Mixing Activations: A New Era for Transformer Networks

Feedforward network layers have long been the backbone of Transformer-based large language models, responsible for a significant portion of their parameter count and nonlinear expressivity. Typically, these layers have relied on fixed activation functions like ReLU or its gated variants such as SwiGLU. However, a new approach, Mixture of Activations (MoA), is challenging the status quo with a dynamic, token-adaptive design.

MoA: A major shift for FFNs?

The MoA framework introduces a dictionary of activation functions that are mixed using lightweight input-dependent gates. This allows the same linear projections to be shared across tokens, potentially offering superior expressivity without a hefty computational cost. In contrast, learnable activations (LA) act as an input-independent counterpart, forming linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs.

Theoretically, the data shows that LA strictly contains fixed-activation FFNs, while MoA exceeds even LA, thanks to its input-dependent nonlinear hybridization. This hierarchy of expressivity crucially positions MoA as a powerful tool in the arsenal of modelizers.

Impressive Benchmarks

The paper, published in Japanese, reveals that MoA's empirical performance is nothing short of remarkable. It consistently achieves lower terminal loss and exhibits more favorable scaling behavior compared to well-tuned baselines. These results are drawn from extensive pre-training experiments on dense and mixture of experts (MoE) models, with parameter counts ranging from 0.12B to 2B. The benchmark results speak for themselves.

Why does this matter? As language models continue to expand in size and complexity, the efficiency of parameter usage becomes important. MoA offers a pathway to enhance expressivity without significantly increasing computational overhead. This could lead to more efficient models that can run with lower resource requirements, potentially democratizing access to advanced language technologies.

The Future of Activation Functions

Western coverage has largely overlooked this development, focusing instead on incremental improvements in parameter size and training data. But the question remains: could MoA be the key to unlocking even greater capabilities in language models? As AI researchers continue to push the envelope, innovations like MoA might just redefine what's possible in natural language processing.

In an industry where bigger usually means better, MoA's approach suggests that smarter, more efficient designs can also lead to significant advancements. This isn't just a technical tweak. it's a strategic shift that could influence how AI models are built in the future. The data supports its potential, and as more researchers take note, we might see a broader adoption of adaptive activation mixing in mainstream AI models.

Mixing Activations: A New Era for Transformer Networks

MoA: A major shift for FFNs?

Impressive Benchmarks

The Future of Activation Functions

Key Terms Explained