(ICLR 2024 spotlight) How Over-Parameterization Slows Down Gradient Descent in Matrix Sensing: The Curses of Symmetry and Initialization

Published in International Conference on Learning Representations, 2024

This paper rigorously shows how over-parameterization dramatically changes the convergence behaviors of gradient descent (GD) for the matrix sensing problem, where the goal is to recover an unknown low-rank ground-truth matrix from near-isotropic linear measurements. First, we consider the symmetric setting with the symmetric parameterization where MRn×n is a positive semi-definite unknown matrix of rank rn, and one uses a symmetric parameterization XX to learn M. Here XRn×k with k>r is the factor matrix. We give a novel Ω(1/T2) lower bound of randomly initialized GD for the over-parameterized case (k>r) where T is the number of iterations. This is in stark contrast to the exact-parameterization scenario (k=r) where the convergence rate is exp(Ω(T)). Next, we study asymmetric setting where MRn1×n2 is the unknown matrix of rank rminn1,n2, and one uses an asymmetric parameterization FG to learn M where FRn1×k and GRn2×k. Building on prior work, we give a global exact convergence result of randomly initialized GD for the exact-parameterization case (k=r) with an exp(Ω(T)) rate. Furthermore, we give the first global exact convergence result for the over-parameterization case (k>r) with an exp(Ω(α2T)) rate where α is the initialization scale. This linear convergence result in the over-parameterization case is especially significant because one can apply the asymmetric parameterization to the symmetric setting to speed up from Ω(1/T2) to linear convergence. Therefore, we identify a surprising phenomenon: asymmetric parameterization can exponentially speed up convergence. Equally surprising is our analysis that highlights the importance of imbalance between F and G. This is in sharp contrast to prior works which emphasize balance. We further give an example showing the dependency on α in the convergence rate is unavoidable in the worst case. On the other hand, we propose a novel method that only modifies one step of GD and obtains a convergence rate independent of α, recovering the rate in the exact-parameterization case. We provide empirical studies to verify our theoretical findings. Download Paper Here