Unpacking Binary Routing in Transformers: A New Perspective

In the area of transformer language models, particularly GPT-2 Small with its 124 million parameters, an intriguing mechanism unfolds within the MLP layers. The chart tells the story: binary routing emerges as a turning point process, directing continuous signals through a seemingly simplified, yet effective, decision-making structure.

Binary Decisions in Neural Networks

The discovery centers on binary neuron activations. While the signals themselves are continuous, whether a token undergoes nonlinear processing is largely determined by binary decisions. It's a consensus architecture, featuring seven 'default-ON' neurons with one standout, the exception handler N2123 in Layer 11. Notably, these neurons operate with 93-98% mutual exclusivity, effectively acting as a binary switch.

Visualize this: as data traverses the network, early layers (L1-3) employ single gateway neurons for routing exceptions, bypassing consensus. Middle layers (L4-6) exhibit a more diffuse approach, lacking both gateway and consensus. It's not until the later layers (L7-11) that a full consensus architecture crystallizes, expanding from one to three, then to seven consensus neurons.

Functional Validation and Implications

Causal validation underscores the functionality of this binary routing. Disrupting the MLP at the consensus breakdown causes a 43.3% increase in perplexity, while removing it at full consensus results in just a 10.1% change. This stark difference highlights the structure's efficiency.

But why should we care? These binary decisions are so effective that binarization loses almost no information, maintaining 79.2% accuracy against 78.8% with continuous features. However, continuous activations do provide additional magnitude information, with R²values of 0.36 compared to 0.22. So, are we overlooking the simplicity of binary routing amid our fascination with complex models?

Rethinking Neural Network Structures

The binary routing structure challenges traditional views on deep networks. Smooth polynomial approximations fail to capture the complexity of highly nonlinear layers, with cross-validated fits never exceeding an R²of 0.06. Instead, this routing offers a fresh perspective, showing that along the natural data manifold, piecewise boundaries implement critical binary decisions for signal processing.

One chart, one takeaway: binary routing isn't just a technical nuance. It's a fundamental shift, offering a complementary view to the established piecewise-affine characterizations of deep networks. As the AI field continues to evolve, this insight demands attention, urging us to reconsider the foundational processes of neural computation. Are we ready to embrace this shift and what it means for the future of AI?