Latest AI News

arXiv cs.AI•about 9 hours ago·6 min read

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

arXiv:2605.26895v1 Announce Type: cross Abstract: Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

Latest News

Linear and Neural Dueling Bandits with Delayed Feedback

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

Latest News

Linear and Neural Dueling Bandits with Delayed Feedback

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

SL-BiLEM: Structured Learnable Behavior-in-the-Loop Epidemic Modeling for Forecasting and Policy Evaluation

Measuring Prediction Uncertainty in Neural Cellular Automata

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

Rotation-Invariant Spherical Watermarking via Third-Order SO(3) Representation Coupling

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

Innovation: An Almost Characterization of Hallucination

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

Periodic Topological Deep Learning for Polymer Design and Discovery

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models