Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

Abstract

Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework, we establish an optimal iteration and sample complexity of $O \left(\varepsilon^{\frac{-(3\alpha-2)}{(\alpha-1)}} \right)$ for finding an $\varepsilon$-stationary point, where $\alpha\in(1,2]$ denotes the heavy-tail index. For the inexact-polar setting with $\sigma_1=0$, we also provide guarantees that do not require prior knowledge of $\alpha$. We analyze a randomized low-rank polar decomposition that is substantially more efficient than full-space methods while remaining compatible with our theory. Numerical experiments further demonstrate the effectiveness of the proposed inexact and randomized variants.

Publication
arXiv preprint arXiv:2605.06884
Sen Na
Sen Na
Assistant Professor in ISyE

Sen Na is an Assistant Professor in the School of Industrial and Systems Engineering at Georgia Tech. Prior to joining ISyE, he was a postdoctoral researcher in the statistics department and ICSI at UC Berkeley. His research interests broadly lie in the mathematical foundations of data science, with topics including high-dimensional statistics, graphical models, semiparametric models, optimal control, and large-scale and stochastic nonlinear optimization. He is also interested in applying machine learning methods to problems in biology, neuroscience, and engineering.

Mladen Kolar
Mladen Kolar
Professor in Data Sciences and Operations

Mladen Kolar is a professor in the Department of Data Sciences and Operations at the USC Marshall School of Business. His research focuses on high-dimensional statistical methods, probabilistic graphical models, and scalable optimization methods, driven by the need to uncover interesting and scientifically meaningful structures from observational data. Mladen was selected as a recipient of the 2024 Junior Leo Breiman Award for his outstanding contributions to these areas.