Random Matrices, Free Probability, and Deep Neural Networks
2024-09-09explainer
๐ Migration note: 23 inline media block(s) (image/file) were not migrated by scripts/copy-post-bodies-to-notes.ts โ Notion's API can only re-attach external URLs, not re-upload internal files. The originals remain on the legacy Posts Public page until Phase 8.
Random Matrices, Free probability, and Deep Neural Networks
Tomohiro Hayase, Cluster Metaverse Lab, Japan.
Talk at Workshop: Bridges between Machine Learning and Quantum Information Science (CIRM)
2024/9/09-10
and T.H. and Benoit Collins, โAsymptotic freeness of layerwise Jacobians caused by invariance of multilayer perceptronโ, in Comm. In Math. Phys. (2023), https://link.springer.com/article/10.1007/s00220-022-04441-7.
Based on T.H and Ryo Karakida, โUnderstanding MLP-Mixer as a Wide and Sparse MLPโ, in ICML2024,
Table of Contents
Overview Deep neural network and Gaussian process
Jacobian Stability of DNN and random matrices
NTK Training dynamics and random matrices
Asymptotic Freeness
Main theorem: asymptotic freeness of Jacobians
Consider two inputs x,xโฒ and corresponding hidden units xโโ,xโโฒโ and hโโ,hโโฒโ in MLP.
Taking an infinite dimensional limit at the initial state, we have [Lee+ICLR2018]
[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]
2.Jacobian
Vanishing/Exploding Gradients
The optimization of DNN needs its parameter derivations.
Since a DNN is a function composition, the chain rule computes the parameter derivations. The input-output Jacobian is defined as
A DNN is said to achieve dynamical isometry If the eigenvalue distribution is concentrated around one.
Dynamical Isotmetry prevents the exploding/vanishing gradients.
[Pennington+, AISTATS2018, CH, CIMP2022] If we set the initialization of parameters to be Haar orthogonal and choose the appropriate activation function, then we can make the DNN to achieve the dynamical isometry.
Set ฮผLโ,ฮฝ be limit spectral distributions of JJโค,D2 as wide limits respectively.
Under the assumption of the asymptotic freeness of Jacobians,
Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]
Eigenvalue Spectrum of NTK
โSpectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networksโ Z. Fan & Z. Wang https://arxiv.org/abs/2005.11879
They treat the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get a recurrence equation of the limit spectral distribution of NTK.
Figures: Red lines are theoretical prediction
One-sample NTK
TH& R.Karakia โThe Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometryโ https://arxiv.org/abs/2006.07814, In AISTATS 2020.
When the DNN achieves dynamical isometry,ย the spectrum of the (one-sample x high-dim output)โ NTKโ concentrates around the maximal value, and the maximal value is O(L).
(Sketch)Under an assumption on Asymptotic Freeness, we have the following recursive equations:
The spectrum (eigenvalues) of the NTK has a vital role in tuning the learning dynamics.
e.g. ฮท>1/ฮปmaxโ(ฮ)โน The learning dynamics do not converge.
e.g. The conditional number c=ฮปminโ/ฮปmaxโ detemines the converge speed.
Redlineย (the borderline of the exploding gradients) :
This line is expected by our theory!
G. Yang expanded NTK to a more general one: maximal update parametrization (muP)!
4. Asymptotic Freeness
Asymptotic Freeness and Free Probability Theory
Definition(Asymtptotic freeness, Cโ-version)[Voiculescuโ85]
Let (Ajโ(n),Ajโ(n)โ)jโJโ be a family of nรn random matrices and adjoints.
The family is said to be asymptotically free almost surelyif there exist Cโ-probability spaces (Ajโ,ฯjโ)jโJโ and elements (ajโโAjโ)jโJโ so that for any QโCโจXjโ,XjโโโฃjโJโฉ, the following holds:
where โjโJโฯjโ is the free product of the tracial states.
Example
For NโN, let
W(N) be Ginibre or Haar orthogonal random matrix,
D(N) be a constant diagonal matrix with a limit distribution as Nโโ.
Then
(W,Wโ) andย D are a.s. asymptotically free as N \to \infty.
Asymptotic Freeness of Jacobians
Let Wโโ,Dโโ(โ=1,2,โฆ,L) be weight matrices in MLP and Wโโ be scaled Haar orthogonal random matrices. (The Gaussian case is treated by : [B. Hanin and M. Nica.], [L. Pastur. ], [G.Yang] )
Theorem [CH22]
Assuming that D1โ,โฆ,DLโ1โ have limit joint moments. Then
We also define an inverse operation to recover the matrix representation.
There exists a well-known equation for the vectorization operation and the tensor ( or Kronecker) product denoted by otimes;
vec(WXV)=(VโคโW)vec(X),
for WโRSรS and VโRCรC.
As discussed later, the aforementioned equation corresponds to the vectorization of an MLP-Mixer block with a linear activation function.
The vectorization of the feature matrix WXV is equivalent to a fully connected layer of width
m=SC
with a weight matrix VโคโW. We refer to this m as the *effective width *of mixing layers.
Under vectorization of feature matrices
Channel-Mixing layer is converted into :
vec(X)โฆ(ICโโW)vec(X),
Token-Mixing layer is converted into:
vec(X)โฆ(VโคโISโ)vec(X),
In MLP-Mixer, when we treat each SรC feature matrix X as an SC-dimensional vector vec(X), the right multiplication by an CรC weight V and the left weight multiplication by a SรS weight W are represented as \begin{align}
Static Mask: Consider a matrix M of entries 0 or 1 distributed and replace W in each layer of MLP by MโW:
y=ฯ((MโW)x)MโผBernoulli(p)
The mask matrix M is fixed during the training.
Hidden features and test accuracy
To validate the similarity of networks in a robust and scalable way, we look at the similarity of hidden features of MLPs with sparse weights and MLP-Mixers based on the centered kernel alignment (CKA) ย Nguyen T., Raghu M, Kornblith S.
In practice, we computed the mini-batch CKA [Section~3.1(2)](Ngueyen 2021) among features of trained networks.
Each experiment is done on CIFAR10with four random seeds.
(a) Average of diagonal entries of CKA between trained MLP-Mixer (S=C=64,32) and MLP with a static mask with different sparsity p (= ratio of 1 in entries of masks). Sparser MLP was similar to
(b) CKA between MLP-Mixer (S=C=64) and MLP with the corresponding sparsity 1/64, and
(c) CKA between the Mixer and a dense MLP.
(d) Test accuracy of MLPs with sparse weights and MLP-Mixers with different widths under ฮฉ=219.