Rethinking SGD's Role in Deep Learning: Noise,...

Rethinking SGD's Role in Deep Learning: Noise, Curvature, and Beyond

By Signe EriksenJune 9, 2026

New research challenges the assumptions about Stochastic Gradient Descent (SGD) noise in deep learning. The findings reveal a nuanced relationship between noise and curvature, offering insights into optimization strategies.

Stochastic Gradient Descent (SGD) has long been a cornerstone in training deep neural networks, but recent findings suggest we may have oversimplified its behavior. By introducing anisotropic noise, SGD biases optimization toward flat minima, but the underlying assumptions about noise and curvature need reevaluation.

Unpacking Assumptions

Traditionally, the Fisher Information Matrix has been equated with the Hessian in contexts of negative log-likelihood losses. This led to the belief that the noise covariance in SGD, denoted asC, mirrors the HessianH. However, this assumption is tenuous at best, often failing in the complex terrain of deep neural networks.

The paper's key contribution: employing the Activity--Weight Duality to establish a more general relationship. Instead of tyingCdirectly toH, it relates to the expected per-sample Hessian squared,C ∝ E_p[h_p²]. This nuanced understanding suggestsCandHapproximately commute, rather than align perfectly.

Deep Dive into Layer Dynamics

What does this mean for deep learning practitioners? Their experiments across different datasets, architectures, and loss functions reveal that within fully connected layers, diagonal elements ofCandHsatisfy empirical power laws. Specifically,C_ii∝ H_ii^γholds true, with exponents ranging between 1 and 2.

These findings provide a unified view of the noise-curvature interplay, essential for those fine-tuning network layers. Isn't it worth reconsidering our optimization strategies in light of this evidence? This layerwise characterization could redefine how we approach network architecture and training.

Why It Matters

In a field driven by rapid innovation, understanding the nuances of optimization is critical. The study challenges previous dogmas, urging a fresh look at SGD's role. While some might argue that the technicalities seem minor, they hold significant implications for achieving state-of-the-art (SOTA) performance.

Given the data-centric approach of modern AI, isn't it essential to ensure our foundational assumptions are sound? By peeling back the layers of noise and curvature, this research not only informs better practices but sparks a dialogue about the very models we rely on. The ablation study reveals the depth of these insights, furthering the potential of SGD in future AI endeavors.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

Rethinking SGD's Role in Deep Learning: Noise, Curvature, and Beyond

Unpacking Assumptions

Deep Dive into Layer Dynamics

Why It Matters

Key Terms Explained