What Transformers Learn
When They Solve MAJORITY

Theory constructs a solution. Gradient descent finds a different one. And a subtle property of training data decides whether it finds anything at all.

Adam Elimadi · March 2026 Code →

Abstract Merrill, Sabharwal, and Smith (2022) prove that a 1-layer transformer with saturated attention can recognize MAJORITY, placing such models strictly above AC⁰ in circuit complexity. Their proof is a hand-crafted existence result — one explicit weight configuration constructed to establish expressivity. We ask a different question: what does gradient descent actually learn when trained on MAJORITY? We find that the learned solution is categorically different from the hand-crafted construction under both training regimes. Softmax-trained models consistently converge to a token-identity clustering mechanism rather than the uniform-attention construction. Saturated training recovers a correct solution in only ~50% of runs at low sample counts, and also via clustering rather than uniform attention. We further show that training data containing balanced inputs — on which MAJORITY is undefined — corrupts learning disproportionately and is the sole source of generalization failure on even-length sequences. These results characterize the gap between hand-crafted expressivity constructions and what gradient descent finds in practice, and identify the definedness of the target function on training inputs as a previously unexamined driver of learnability.

1 · Background

The formal language expressivity of transformers has been characterized through circuit complexity. Hahn (2020) and Hao et al. (2022) establish that transformers with hard attention — distributions concentrating all mass on a single position — can only recognize languages within AC⁰, the class of problems solvable by constant-depth, polynomial-size circuits with AND/OR gates. Since MAJORITY lies outside AC⁰ (Furst et al., 1981), hard-attention transformers cannot solve it.

Merrill et al. (2022) extend this analysis to saturated attention:

ζ(a)_j = ¹/_|M(a)| · 1[j ∈ M(a)], M(a) = { i : a_i = max_k a_k }

(1)

This recovers hard attention when |M(a)| = 1 and uniform attention when |M(a)| = n. Merrill et al. prove that saturated transformers over floating-point values fall within TC⁰, strictly above AC⁰. Their lower-bound proof constructs a 1-layer transformer recognizing MAJORITY via a single uniform attention head with all query and key parameters set to zero. The resulting head computes the mean token value; a linear classifier thresholds at 0.5.

2 · Model and Experimental Setup

We train a 1-layer transformer with a single attention head, token embeddings, mean pooling, and a linear classifier. The architecture has no positional encodings, no feedforward sublayer, no layer normalization, and no residual connections. The absence of positional encodings follows Merrill et al.'s empirical finding that position-free models generalize best on MAJORITY.

Unless otherwise stated, we fix embedding dimension d = 1 throughout. At this dimension, all weight matrices collapse to scalars and the attention score between any two token embeddings e_i, e_j ∈ ℝ reduces to:

a_ij = q(e_i) · k(e_j) / √d

(2)

Since the vocabulary is {0, 1}, every attention interaction is one of two types. We define the same-token score s_same = q(e₀) · k(e₀) and the cross-token score s_diff = q(e₀) · k(e₁). The clustering gap is:

Δ = s_same − s_diff

(3)

When Δ > 0, tokens attend preferentially to same-value tokens. When Δ = 0, attention is uniform — the theoretical construction of Merrill et al. When Δ < 0, tokens attend preferentially to opposite-value tokens. Every trained model is fully characterized by the pair (s_same, s_diff), making the learned solution fully transparent.

MAJORITY is defined as:

MAJ(w) = 1[ Σ_i w_i > n/2 ]

(4)

This is a partial function. For even-length sequences with Σ_i w_i = n/2, the output is undefined. Training data assigns these inputs label 0 by convention — an arbitrary choice with consequences examined in Finding 3.

Unless otherwise stated, all experiments use Adam (lr = 0.01), 20 training epochs, 1,000 training sequences, and 200 test sequences per evaluation. We run 10 independent seeds per condition. Training and test sets are generated independently. Accuracy is reported under both softmax and saturated inference on the trained weights.

3 · Finding 1 — Optimizer Determines Recovery

AdamW with d = 1 learns MAJORITY and generalizes perfectly to sequences 27× longer than training: 100% accuracy at test length 301 when trained at length 11, across all 10 seeds. SGD with the same architecture and embedding dimension performs at chance.

Increasing embedding dimension does not compensate. At d = 1024, SGD reaches a ceiling of approximately 95% and cannot generalize to length 301 at any capacity.

The picture for AdamW is more nuanced. At low dimension (d ≤ 52), AdamW achieves perfect generalization across all seeds. But as capacity grows, reliability degrades: at d = 112 the mean drops to ~97.6% with emerging variance, at d = 512 three seeds collapse near chance pulling the mean to ~88.3%, and at d = 1024 six seeds fail outright (mean ~73.5%). Excess capacity appears to destabilize AdamW training on this task — the ground-minimum-capacity solution at d = 1 is not just sufficient, it may be optimal.

SGD vs AdamW optimizer comparison — **Figure 1.** Softmax accuracy at test length 301 across embedding sizes, averaged over 10 seeds (shading = ±1 std). SGD plateaus below 96% regardless of capacity. AdamW achieves perfect accuracy at low dimension (d ≤ 52) but degrades at d = 112 and becomes unreliable at d = 512, where three seeds collapse to near chance — suggesting that excess capacity can destabilize AdamW training on this task.

A possible explanation for Adam's reliability at minimum capacity is that MAJORITY generates heterogeneous gradient signals: same-value token pairs must push s_same positive while cross-value pairs push s_diff negative. Adam's per-parameter adaptive learning rates handle this structure naturally; SGD applies uniform scaling and fails to orient the embeddings correctly at low dimension. Why this advantage inverts at high capacity — where excess dimensions appear to destabilize Adam rather than help — remains an open question.

Finding 1

Adam recovers MAJORITY at d = 1 and generalizes perfectly to all tested lengths. SGD requires d ≥ 15 to begin learning and plateaus at ~95% regardless of further capacity increase. Model capacity does not substitute for optimizer adaptivity on this task. Notably, excess capacity hurts Adam: reliability degrades above d = 52 and collapses at d = 1024, suggesting the ground-minimum-capacity solution is not just sufficient but optimal.

4 · Finding 2 — Learned Mechanism vs. Theoretical Construction

Merrill et al.'s hand-crafted construction sets all query and key parameters to zero, yielding uniform attention with Δ = 0. Every token contributes equally; the head computes the mean embedding and a linear classifier thresholds at 0.5. We use this construction as a reference point — a landmark in the (s_same, s_diff) plane — to characterize how far gradient descent lands from what theory exhibits by hand.

To measure what gradient descent actually finds, we extract the four learned scalars — e₀, e₁, w_q, w_k — directly from the trained model weights after each run and compute s_same and s_diff as defined in §2. Because d = 1, these four scalars fully determine the model's attention behavior; the entire learned mechanism reduces to a single point in the (s_same, s_diff) plane. Figure 2 plots this point for every seed across all training conditions.

Embedding geometry scatter plot — **Figure 2.** Learned (s_same, s_diff) extracted from trained model weights for each seed across all four training conditions. The dashed diagonal marks Δ = 0 — the hand-crafted uniform-attention construction. Softmax-trained seeds (top) cluster tightly in the bottom-right quadrant with Δ ≈ 1.10, never approaching the diagonal. Saturated-trained seeds (bottom) scatter widely with no consistent attractor; seed 4 is the extreme outlier in both panels, while seed 8 lands near Δ = 0 and succeeds — the rare case where a trained model approaches the hand-crafted construction.

Gradient descent never recovers the hand-crafted construction. Across all 10 seeds under softmax training on odd-length sequences, the extracted weights place every model in the bottom-right quadrant: s_same ≈ +0.55, s_diff ≈ −0.55, Δ ≈ 1.10. Same-value tokens attend preferentially to each other; cross-value tokens are suppressed. The hand-crafted construction (Δ = 0) is never approached — all 10 seeds land far from the diagonal, visible in the top-left panel of Figure 2 as a tight cluster in the bottom-right quadrant.

Saturated training is unreliable

Training directly with saturated attention produces qualitatively different behavior, as shown in the bottom panels of Figure 2. Across 10 seeds, only 5/10 recover a correct solution under odd-length training and 4/10 under even-length training. The failed seeds do not form a consistent alternative attractor — they scatter across all four quadrants: strong positive clustering, near-zero weights, and inverted clustering (Δ < 0) all appear. When saturated training succeeds, it does so through two distinct routes: clustering similar to softmax (Δ ≈ 1.0–1.5), or in isolated cases extreme clustering — seed 4 reaches Δ ≈ 13.47 under even-length training and Δ ≈ 11.96 under odd-length training. Seed 8 is the sole exception, converging near the hand-crafted construction (Δ ≈ 0.09) and succeeding — the only instance where a trained model approaches uniform attention. There is no reliable path to it, and the mechanism varies widely across runs.

That said, the ~50% success rate reflects a low-data regime. With substantially more training samples (10,000–30,000), saturated training solves MAJORITY reliably — converging to clustering solutions of varying magnitudes, but never to the hand-crafted uniform-attention construction.

To visualize how these weight configurations manifest in practice, Figure 3 shows the attention matrices produced by a representative input sequence under each regime. For the softmax case we take a randomly selected successful seed, as all runs converge to the same clustering solution. For the saturated case we show seed 8 specifically — the only run whose weights land near the hand-crafted construction (Δ ≈ 0.09), and thus the closest any trained model comes to the uniform-attention reference point.

Attention heatmaps — **Figure 3.** Attention weight matrices for the sequence [1,0,1,1,0,1,0,1,1,0,1] under softmax training (*left*) and saturated training at seed 8 (*right*, Δ ≈ 0.09). Each cell (i, j) shows how much token i attends to token j. Under softmax, same-value token pairs receive consistently higher weight than cross-value pairs — the learned clustering solution. Under saturated attention, the hard argmax amplifies even the small weight differences of seed 8 into a near-binary pattern, illustrating how far the realized attention behavior remains from uniform even in the most favorable case.

Finding 2

Gradient descent never recovers the hand-crafted uniform-attention construction (Δ = 0). Softmax training converges consistently to a token-clustering solution (Δ ≈ 1.10). Saturated training finds clustering solutions as well — reliably only at larger sample counts — and never the hand-crafted construction.

5 · Finding 3 — Undefined Inputs Corrupt Training

The initial observation was straightforward: models trained on odd-length sequences generalize reliably, while those trained on even-length sequences degrade. The natural hypothesis was some form of parity sensitivity — an inductive bias in the learned mechanism that favors odd-length training. Extended runs and controlled ablations show this hypothesis is wrong.

QK scatter showing the clustering attractor — **Figure 4.** Softmax and Saturated accuracy across test lengths, averaged over 10 seeds (shading = ±1 std). *Left* (odd-length training, train = 11): both regimes generalize well at all test lengths, with Saturated slightly outperforming Softmax. *Right* (even-length training, train = 10): Softmax accuracy degrades steadily with length; Saturated collapses more sharply. The divergence between panels isolates even-length training as the source of failure.

The actual cause has nothing to do with parity. It is a property of the training data: even-length sequences can be balanced, and balanced inputs sit outside MAJORITY's domain. For even n, sequences with Σ_i w_i = n/2 are not in the function's domain; they receive label 0 by convention. Odd-length sequences contain no such inputs. Even-length random sequences contain them at approximately Θ(1/√n), around 22% at length 12.

To isolate the causal factor, we run five controlled training conditions evaluated at test length 100. Conditions differ only in data generation:

Condition	Ties	Class balance	Mean acc ± std	Min
A: odd, random	0%	50/50	95.5 ± 4.7%	88.0%
B: even, random	~22%	~61/39	70.9 ± 2.7%	65.0%
C: even, no ties, balanced	0%	50/50	95.8 ± 4.5%	88.0%
D: even, 30% ties, balanced	30%	50/50	54.5 ± 3.0%	48.5%
E: even, no ties, imbalanced	0%	35/65	95.8 ± 4.7%	87.5%

Five-condition isolation experiment — **Figure 5.** Accuracy at test length 100 across 10 seeds per condition. Horizontal bars show the mean. Conditions A and C are statistically indistinguishable. Condition D collapses to near chance despite balanced class labels. Condition E (imbalance only, no ties) matches A and C.

Conditions A and C are statistically identical: 95.5% and 95.8%. Condition E (even-length, class-imbalanced, no ties) also matches at 95.8%. Condition D (30% ties, balanced classes) collapses to 54.5%. Length parity and class imbalance have no independent effect. Ties are the sole causal factor.

The mechanism is as follows. The clustering solution requires oriented gradient signal: same-value pairs must push Δ positive. Balanced inputs provide equal same-value and cross-value interactions, contributing zero net gradient to the clustering direction. The conventional label 0 additionally applies a classification loss inconsistent with any clustering solution, since a balanced sequence has no majority to cluster toward. Even 22–30% of such inputs prevents the clustering attractor from forming.

Positive control: OR function

As a control, we train the same architecture on OR: OR(w) = 1[∃i : w_i = 1]. OR is a total function — defined on every binary string, including the all-zeros sequence — with no undefined inputs regardless of sequence length. Training on even-length sequences produces the same accuracy as odd-length training across all seeds. The degradation observed for MAJORITY does not appear, confirming that the effect is specific to the presence of undefined inputs and not a general property of even-length training.

You can verify this by running the OR experiment as prompted in the repository.

Finding 3

Generalization failure on even-length training is caused entirely by balanced inputs on which MAJORITY is undefined. Removing these inputs restores performance to match odd-length training. Class imbalance has no independent effect. A positive control on OR — a total function with no undefined inputs — confirms that even-length training succeeds when the target function is defined on all training examples.

Zhou et al. (2024) implicitly acknowledge this in their treatment of the MODE task, perturbing balanced inputs before training to ensure unique labels. The present experiments show that relaxing this prerequisite is sufficient to eliminate learning, and that the effect is disproportionate to the fraction of undefined inputs in the data.

6 · Conclusion

Three experiments on a single Boolean function surface something broader: the gap between what transformers can compute and what gradient descent actually finds is non-trivial. We believe carefully examining the latter can tremendously alter our understading of models and their preferences in training grounds. The findings raise questions we think are worth pursuing.

Open Questions & Future Directions

Optimizer geometry and discrete solutions

Adam recovers a solution to Majority reliably with ground-minimum capacity while SGD cannot. We would love to see future work focusing on optimizers and their contributions in such setups.

A theory of token-clustering convergence

Transformers consistently converge to a token-clustering approach rather than the theoretically predicted threshold circuit. The conditions under which this attractor forms — and whether it generalizes to other functions — lack a formal account. A rigorous theory here would substantially clarify what transformers actually learn.

Mechanism diversity in saturated transformers

Saturated training finds clustering solutions rather than the hand-crafted uniform-attention construction, and does so unreliably at low sample counts. With more data it becomes reliable, converging to clustering comparable to softmax. Why the hand-crafted construction is not a gradient descent attractor — even when the architecture is expressive enough to represent it — and whether sufficient data or regularization could steer learning toward it, remains an open question.

Verification at scale

These results are established on small, controlled models. Whether undefined inputs corrupt training proportionally at scale — in large language models or on richer function classes — is an open empirical question with direct implications for data curation.

References

Furst, M., Saxe, J. B., & Sipser, M. (1981). Parity, circuits, and the polynomial-time hierarchy. 22nd Annual Symposium on Foundations of Computer Science, 260–270.

Hahn, M. (2020). Theoretical limitations of self-attention in neural sequence models. Transactions of the Association for Computational Linguistics, 8, 156–171.

Hao, Y., Angluin, D., & Frank, R. (2022). Formal language recognition by hard attention transformers. Transactions of the Association for Computational Linguistics, 10, 800–810.

Merrill, W., Sabharwal, A., & Smith, N. A. (2022). Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10, 843–856.

Zhou, H., Bradley, A., Littwin, E., Razin, N., Saremi, O., Sohl-Dickstein, J., & Nakkiran, P. (2024). What algorithms can transformers learn? A study in length generalization. ICLR 2024.

Citation

Please cite this work as:

Elimadi, Adam. "What Transformers Learn When They Solve MAJORITY",
GitHub Pages: Blog-1, Feb 2026.
https://brokttv.github.io/Blog-1/

Or use the BibTeX citation:

@misc{Elimadi2026majority,
  author       = {Adam Elimadi},
  title        = {What Transformers Learn When They Solve {MAJORITY}},
  year         = {2026},
  howpublished = {\url{https://brokttv.github.io/Blog-1/}},
  note         = {GitHub Pages: Blog-1}
}