Decoding the Layers of 1B-Class Language Models

In the intricate world of language models, understanding the developmental trajectory of attention-head circuits is important. Recent research delves into this trajectory in three notable 1B-class models: Pythia 1B, OLMo 1B-0724-hf, and OLMoE 1B-7B-0924. These models, spanning two architecture families and pretraining corpora, offer a fascinating glimpse into how attention-head circuits form and evolve.

Distinct Emergence Patterns

The study reveals a striking finding: Layers 0 and 1 consistently produce zero BOS-classified heads across all models and revisions. This isn't a learned outcome but an architectural characteristic. This insight into the structural zero-BOS floor sets the stage for further exploration into how models develop their distinct capabilities.

What sets these models apart is their unique whole-model BOS-attractor fraction emergence shapes. The Pythia 1B model exhibits a gradual ramp, indicating a steady and somewhat predictable development. In contrast, OLMo 1B undergoes a sharp phase transition, jumping from 7% to 70% between checkpoints, demonstrating a more abrupt and transformative development phase. Meanwhile, OLMoE 1B-7B follows a gradual ramp akin to Pythia, yet nuanced in its own right.

The Timing of Capabilities

An intriguing aspect is the timing of induction-circuit formation, particularly in models trained on the DCLM corpus. Here, induction-circuit formation precedes BOS-attractor formation by a staggering factor of 10-20 times in tokens. This challenges the assumption that these phases might coincide, highlighting a distinct separation between the capability-circuit and attention-sink transitions.

The deeper question arises: Why does the capability-specific screen converge to the final induction circuit within a mere 0.3-2% of total training tokens? This precocious convergence suggests that models can identify circuits much earlier than previously thought, negating the need for fully trained models to understand their functional structures.

Implications for Model Training

While the results refine our understanding of the induction-phase transition, they also prompt a reevaluation of model training strategies. The separation between the induction and attention-sink transitions by an order of magnitude in tokens implies that different training approaches might optimize these phases more effectively. Should we reconsider how we design and sequence training protocols to align with these nuanced developmental patterns?

The implications of these findings extend beyond mere technicalities. By understanding the distinct emergence and timing of attention-head circuits, researchers and developers can fine-tune models for improved performance and efficiency. This deeper insight might just be the key to unlocking the next wave of advancements in language model architecture.

Decoding the Layers of 1B-Class Language Models

Distinct Emergence Patterns

The Timing of Capabilities

Implications for Model Training

Key Terms Explained