Theory constructs a solution. Gradient descent finds a different one. And a subtle property of training data decides whether it finds anything at all.
The formal language expressivity of transformers has been characterized through circuit complexity. Hahn (2020) and Hao et al. (2022) establish that transformers with hard attention — distributions concentrating all mass on a single position — can only recognize languages within AC⁰, the class of problems solvable by constant-depth, polynomial-size circuits with AND/OR gates. Since MAJORITY lies outside AC⁰ (Furst et al., 1981), hard-attention transformers cannot solve it.
Merrill et al. (2022) extend this analysis to saturated attention:
This recovers hard attention when |M(a)| = 1 and uniform attention when |M(a)| = n. Merrill et al. prove that saturated transformers over floating-point values fall within TC⁰, strictly above AC⁰. Their lower-bound proof constructs a 1-layer transformer recognizing MAJORITY via a single uniform attention head with all query and key parameters set to zero. The resulting head computes the mean token value; a linear classifier thresholds at 0.5.
We train a 1-layer transformer with a single attention head, token embeddings, mean pooling, and a linear classifier. The architecture has no positional encodings, no feedforward sublayer, no layer normalization, and no residual connections. The absence of positional encodings follows Merrill et al.'s empirical finding that position-free models generalize best on MAJORITY.
Unless otherwise stated, we fix embedding dimension d = 1 throughout. At this dimension, all weight matrices collapse to scalars and the attention score between any two token embeddings ei, ej ∈ ℝ reduces to:
Since the vocabulary is {0, 1}, every attention interaction is one of two types. We define the same-token score ssame = q(e0) · k(e0) and the cross-token score sdiff = q(e0) · k(e1). The clustering gap is:
When Δ > 0, tokens attend preferentially to same-value tokens. When Δ = 0, attention is uniform — the theoretical construction of Merrill et al. When Δ < 0, tokens attend preferentially to opposite-value tokens. Every trained model is fully characterized by the pair (ssame, sdiff), making the learned solution fully transparent.
MAJORITY is defined as:
This is a partial function. For even-length sequences with Σi wi = n/2, the output is undefined. Training data assigns these inputs label 0 by convention — an arbitrary choice with consequences examined in Finding 3.
Unless otherwise stated, all experiments use Adam (lr = 0.01), 20 training epochs, 1,000 training sequences, and 200 test sequences per evaluation. We run 10 independent seeds per condition. Training and test sets are generated independently. Accuracy is reported under both softmax and saturated inference on the trained weights.
AdamW with d = 1 learns MAJORITY and generalizes perfectly to sequences 27× longer than training: 100% accuracy at test length 301 when trained at length 11, across all 10 seeds. SGD with the same architecture and embedding dimension performs at chance.
Increasing embedding dimension does not compensate. At d = 1024, SGD reaches a ceiling of approximately 95% and cannot generalize to length 301 at any capacity.
The picture for AdamW is more nuanced. At low dimension (d ≤ 52), AdamW achieves perfect generalization across all seeds. But as capacity grows, reliability degrades: at d = 112 the mean drops to ~97.6% with emerging variance, at d = 512 three seeds collapse near chance pulling the mean to ~88.3%, and at d = 1024 six seeds fail outright (mean ~73.5%). Excess capacity appears to destabilize AdamW training on this task — the ground-minimum-capacity solution at d = 1 is not just sufficient, it may be optimal.
A possible explanation for Adam's reliability at minimum capacity is that MAJORITY generates heterogeneous gradient signals: same-value token pairs must push ssame positive while cross-value pairs push sdiff negative. Adam's per-parameter adaptive learning rates handle this structure naturally; SGD applies uniform scaling and fails to orient the embeddings correctly at low dimension. Why this advantage inverts at high capacity — where excess dimensions appear to destabilize Adam rather than help — remains an open question.
Adam recovers MAJORITY at d = 1 and generalizes perfectly to all tested lengths. SGD requires d ≥ 15 to begin learning and plateaus at ~95% regardless of further capacity increase. Model capacity does not substitute for optimizer adaptivity on this task. Notably, excess capacity hurts Adam: reliability degrades above d = 52 and collapses at d = 1024, suggesting the ground-minimum-capacity solution is not just sufficient but optimal.
Merrill et al.'s hand-crafted construction sets all query and key parameters to zero, yielding uniform attention with Δ = 0. Every token contributes equally; the head computes the mean embedding and a linear classifier thresholds at 0.5. We use this construction as a reference point — a landmark in the (ssame, sdiff) plane — to characterize how far gradient descent lands from what theory exhibits by hand.
To measure what gradient descent actually finds, we extract the four learned scalars — e0, e1, wq, wk — directly from the trained model weights after each run and compute ssame and sdiff as defined in §2. Because d = 1, these four scalars fully determine the model's attention behavior; the entire learned mechanism reduces to a single point in the (ssame, sdiff) plane. Figure 2 plots this point for every seed across all training conditions.
Gradient descent never recovers the hand-crafted construction. Across all 10 seeds under softmax training on odd-length sequences, the extracted weights place every model in the bottom-right quadrant: ssame ≈ +0.55, sdiff ≈ −0.55, Δ ≈ 1.10. Same-value tokens attend preferentially to each other; cross-value tokens are suppressed. The hand-crafted construction (Δ = 0) is never approached — all 10 seeds land far from the diagonal, visible in the top-left panel of Figure 2 as a tight cluster in the bottom-right quadrant.
Training directly with saturated attention produces qualitatively different behavior, as shown in the bottom panels of Figure 2. Across 10 seeds, only 5/10 recover a correct solution under odd-length training and 4/10 under even-length training. The failed seeds do not form a consistent alternative attractor — they scatter across all four quadrants: strong positive clustering, near-zero weights, and inverted clustering (Δ < 0) all appear. When saturated training succeeds, it does so through two distinct routes: clustering similar to softmax (Δ ≈ 1.0–1.5), or in isolated cases extreme clustering — seed 4 reaches Δ ≈ 13.47 under even-length training and Δ ≈ 11.96 under odd-length training. Seed 8 is the sole exception, converging near the hand-crafted construction (Δ ≈ 0.09) and succeeding — the only instance where a trained model approaches uniform attention. There is no reliable path to it, and the mechanism varies widely across runs.
That said, the ~50% success rate reflects a low-data regime. With substantially more training samples (10,000–30,000), saturated training solves MAJORITY reliably — converging to clustering solutions of varying magnitudes, but never to the hand-crafted uniform-attention construction.
To visualize how these weight configurations manifest in practice, Figure 3 shows the attention matrices produced by a representative input sequence under each regime. For the softmax case we take a randomly selected successful seed, as all runs converge to the same clustering solution. For the saturated case we show seed 8 specifically — the only run whose weights land near the hand-crafted construction (Δ ≈ 0.09), and thus the closest any trained model comes to the uniform-attention reference point.
Gradient descent never recovers the hand-crafted uniform-attention construction (Δ = 0). Softmax training converges consistently to a token-clustering solution (Δ ≈ 1.10). Saturated training finds clustering solutions as well — reliably only at larger sample counts — and never the hand-crafted construction.
The initial observation was straightforward: models trained on odd-length sequences generalize reliably, while those trained on even-length sequences degrade. The natural hypothesis was some form of parity sensitivity — an inductive bias in the learned mechanism that favors odd-length training. Extended runs and controlled ablations show this hypothesis is wrong.
The actual cause has nothing to do with parity. It is a property of the training data: even-length sequences can be balanced, and balanced inputs sit outside MAJORITY's domain. For even n, sequences with Σi wi = n/2 are not in the function's domain; they receive label 0 by convention. Odd-length sequences contain no such inputs. Even-length random sequences contain them at approximately Θ(1/√n), around 22% at length 12.
To isolate the causal factor, we run five controlled training conditions evaluated at test length 100. Conditions differ only in data generation:
| Condition | Ties | Class balance | Mean acc ± std | Min |
|---|---|---|---|---|
| A: odd, random | 0% | 50/50 | 95.5 ± 4.7% | 88.0% |
| B: even, random | ~22% | ~61/39 | 70.9 ± 2.7% | 65.0% |
| C: even, no ties, balanced | 0% | 50/50 | 95.8 ± 4.5% | 88.0% |
| D: even, 30% ties, balanced | 30% | 50/50 | 54.5 ± 3.0% | 48.5% |
| E: even, no ties, imbalanced | 0% | 35/65 | 95.8 ± 4.7% | 87.5% |
Conditions A and C are statistically identical: 95.5% and 95.8%. Condition E (even-length, class-imbalanced, no ties) also matches at 95.8%. Condition D (30% ties, balanced classes) collapses to 54.5%. Length parity and class imbalance have no independent effect. Ties are the sole causal factor.
The mechanism is as follows. The clustering solution requires oriented gradient signal: same-value pairs must push Δ positive. Balanced inputs provide equal same-value and cross-value interactions, contributing zero net gradient to the clustering direction. The conventional label 0 additionally applies a classification loss inconsistent with any clustering solution, since a balanced sequence has no majority to cluster toward. Even 22–30% of such inputs prevents the clustering attractor from forming.
As a control, we train the same architecture on OR: OR(w) = 1[∃i : wi = 1]. OR is a total function — defined on every binary string, including the all-zeros sequence — with no undefined inputs regardless of sequence length. Training on even-length sequences produces the same accuracy as odd-length training across all seeds. The degradation observed for MAJORITY does not appear, confirming that the effect is specific to the presence of undefined inputs and not a general property of even-length training.
You can verify this by running the OR experiment as prompted in the repository.
Generalization failure on even-length training is caused entirely by balanced inputs on which MAJORITY is undefined. Removing these inputs restores performance to match odd-length training. Class imbalance has no independent effect. A positive control on OR — a total function with no undefined inputs — confirms that even-length training succeeds when the target function is defined on all training examples.
Zhou et al. (2024) implicitly acknowledge this in their treatment of the MODE task, perturbing balanced inputs before training to ensure unique labels. The present experiments show that relaxing this prerequisite is sufficient to eliminate learning, and that the effect is disproportionate to the fraction of undefined inputs in the data.
Three experiments on a single Boolean function surface something broader: the gap between what transformers can compute and what gradient descent actually finds is non-trivial. We believe carefully examining the latter can tremendously alter our understading of models and their preferences in training grounds. The findings raise questions we think are worth pursuing.
Optimizer geometry and discrete solutions
Adam recovers a solution to Majority reliably with ground-minimum capacity while SGD cannot. We would love to see future work focusing on optimizers and their contributions in such setups.
A theory of token-clustering convergence
Transformers consistently converge to a token-clustering approach rather than the theoretically predicted threshold circuit. The conditions under which this attractor forms — and whether it generalizes to other functions — lack a formal account. A rigorous theory here would substantially clarify what transformers actually learn.
Mechanism diversity in saturated transformers
Saturated training finds clustering solutions rather than the hand-crafted uniform-attention construction, and does so unreliably at low sample counts. With more data it becomes reliable, converging to clustering comparable to softmax. Why the hand-crafted construction is not a gradient descent attractor — even when the architecture is expressive enough to represent it — and whether sufficient data or regularization could steer learning toward it, remains an open question.
Verification at scale
These results are established on small, controlled models. Whether undefined inputs corrupt training proportionally at scale — in large language models or on richer function classes — is an open empirical question with direct implications for data curation.
Furst, M., Saxe, J. B., & Sipser, M. (1981). Parity, circuits, and the polynomial-time hierarchy. 22nd Annual Symposium on Foundations of Computer Science, 260–270.
Hahn, M. (2020). Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8, 156–171.
Hao, Y., Angluin, D., & Frank, R. (2022). Formal language recognition by hard attention transformers. Transactions of the Association for Computational Linguistics, 10, 800–810.
Merrill, W., Sabharwal, A., & Smith, N. A. (2022). Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10, 843–856.
Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Sohl-Dickstein, J., & Nakkiran, P. (2024). What algorithms can transformers learn? A study in length generalization. ICLR 2024.
Please cite this work as:
Elimadi, Adam. "What Transformers Learn When They Solve MAJORITY", GitHub Pages: Blog-1, Feb 2026. https://brokttv.github.io/Blog-1/
Or use the BibTeX citation:
@misc{Elimadi2026majority,
author = {Adam Elimadi},
title = {What Transformers Learn When They Solve {MAJORITY}},
year = {2026},
howpublished = {\url{https://brokttv.github.io/Blog-1/}},
note = {GitHub Pages: Blog-1}
}