Asymptotic Freeness of MLP and Related Topics
π Migration note: 20 inline media block(s) (image/file) were not migrated by scripts/copy-post-bodies-to-notes.ts β Notion's API can only re-attach external URLs, not re-upload internal files. The originals remain on the legacy Posts Public page until Phase 8.
Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices appearing in research related to deep neural networks (DNNs). However, the critical assumption of asymptotic freeness of the layerwise Jacobian has not been proven mathematically so far. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key to the proof is an invariance of the MLP.
Asymptotic freeness in multilayer perceptron and related topics
Tomohiro Hayase
Affiliation: Cluster Metaverse Lab.
Talk at 2023/June/5β9, Radom Matrix and Applications, RIMS, Kyoto.
Based on T.H. and Benoit Collins βAsymptotic freeness of layerwise Jacobians caused by invariance of multilayer perceptronβ, in Comm. In Math. Phys. (2023), https://link.springer.com/article/10.1007/s00220-022-04441-7. Supported by JST ACT-X and JSPS Sakura Program.
Table of Contents
- Overview
Deep neural network and Gaussian process - Jacobian
Stability of DNN and random matrices - NTK
Training dynamics and random matrices - Asymptotic Freeness Main theorem: asymptotic freeness of Jacobians
- Summary & In Progress
Overview
Multilayer Perceptron
[Figure: https://www.javatpoint.com/multi-layer-perceptron-in-tensorflow] Let Parameters:
Forward propagation: for set and inductively
Finally, deinfe the output by
: Activation Function
Deep Learning
Generally, a standard formulation of supervised deep learning is as follows:
- We are given a finite set of pairs of input/output data .
- We are given a deep neural network (DNN), which is a composition of (parameterized) transformations, which maps a real vector to a real vector.
- We are given an object function : e.g. mean squared loss
Optimization
We minimize the loss function by gradient descent:
Initialization of Parameters and Random Matrices
e.g. Gaussian (Ginibre) random matrix:
e.g. Haar distributied orthogonal matrix:
The Inifnite-dimensional Limit is Gaussian
[Figure: https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html]
Neural Network Gaussian Process (NNGP)
Consider two inputs and corresponding hidden units and in MLP. Taking an inifnite dimensional limit at the initial state, we have [Lee+ICLR2018]
where
We have the following Kernel Propagation:
where
- For some activation functions, we can compute the integral explicitly.
Application: NNGP Estimation
Generally, consider samples. Set be input/output samples.
Then the posterior mean/ var is given by the following : for a new input
[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]
Jacobian
Vanishing/Exploding Gradients
The optimization of DNN needs its parameter derivations. Since a DNN is a composition of functions, the parameter derivations are computed byΒ the chian rule. The input-output Jacobian is defined as
In the case of MLP, we have
where
Dynamical Isometry
A DNN is said to achive dynamical isometry If the eigenvalue distribution of is concentrated aound one. Dynamical Isotmetry prevents the exploding/vanshing gradients.
[Pennington+, AISTATS2018, CH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry.
Set be limit spectral distributions of as wide limits respectively.
Under the assumption of asymptotic freeness of Jacobians,
where is the free multiplicative convolution,
Distribution of
[Figure: Pennington,Β Schoenholz, Ganguli, AISTATS2018]
The Limit Spectral Distribution of
[Figure: Pennington,Β Schoenholz, Ganguli, AISTATS2018]
Neural Tangent Kernel
Neural Tangent Kernel
Under continual vertion of GD, learning dynamics of parameters is given by:
( * The learning rate is fixed.) Then learning dynamics of DNN is given by:
where
**Informal[Jacot+NeurIPS2018, Lee+NeruIPS2019]**Under the wide limit , the learning dynamics of DNN is approximated by
where the neural tangent kernel is defined as
Neural Tangent Kernel is A Surrogate Model of DNN+GD
Based of NTK, we can do Bayesian estimation in the same way as NNGP. Moreover, with NTK, we can simulate the gradient descent at any step of ensemble newtorks. [Figure from Google βFast and Easy Infinitely Wide Networks with Neural Tangentsβ]
Appliable to CNN/ResNet
Figure from [Google βFast and Easy Infinitely Wide Networks with Neural Tangentsβ]
Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]
Eigenvalue Spectrum of NTK
βSpectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networksβ Z. Fan & Z. Wang https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samplesΒ x Small output dimension, and they get an recurrence equation of the limit spectral distribution of NTK. Figures: Red lines are theoretical prediction
One-sample NTK
TH& R.Karakia βThe Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometryβ https://arxiv.org/abs/2006.07814, In AISTATS 2020. When the DNN achieves dynamical isometry,Β the spectrum of the (one-sample x high-dim output)βNTKβ concentrates around the maximal value, and the maximal values is O(L). (Sketch)Under an assumption on Asymptotic Freeness, we have the following recursive equations:
NTK & Learning Rate
The spectrum (eigenvalues) of the NTK has vital role in tuning the learning dynamics. e.g. The learning dynamics does not converge. e.g. The conditional number detemines the converges speed.
Red lineΒ (the boarder line of the exploding gradients) : This line is expected by our theory !
Asymptotic Freeness
Asymptotic Freeness and Free Probability Theory
Definition(Asymtptotic freeness, C-version)[Voiculescuβ85] Let be a family of random matrices and adjoints. The family is said to be asymptotically free almost surely, if there exists C-probability spaces and elements so that for any , the following holds:
where is the free product of the tracial states.
Example
For let
- be Ginibre or Haar orthogonal random matrix,
- be constant diagonal matrix with a limit distirubution as Then andΒ are a.s. asymptotically free as N \to \infty.
Asymptotic Freeness of Jacobians
Let be weight matrices in MLP and be scaled Haar orthogonal random matrices. (The Gaussian case is treated by : [B. Hanin and M. Nica.], [L. Pastur. ], [G.Yang] ) Theorem [CH22] Assuming that have limit joint moments. Then
are asymptotically freeΒ as Β almost surely.
Difficulty: Entries of are not independent.
(Sketch of Proof) Invariance of MLP + Taking submatrix Construct orthogonal matrix fixing , i.e.
and
with
for Then we only need to show the asymptotic freeness of submatrices of
Summary
Summary
Considering neural networks with random paramtersβ¦
- Tuning initializaiton and learning rate
- Bayesian Estimation with NNGP
- Understanding Dynamics with NTK
- If we forcus on the spectrum, free probability appears in the theory.
In Progress
MLP-like NNs were previously only used for toy models, but are now being applied to real-world 2D images and 3D data. e.g. MLP-Mixer, NeRF, etc. Theoretically and practically easy to compute positively, just the right next research target! [Mildenhall, et.al., ***β*NeRF: Representing Scenes as Neural Radiance Fields for View Synthesisβ, ECCV 2020 ]
In progress: MLP-Mixer as a wide and sparse MLP
TH& R. Karakida βMLPMixer as a wide and sparse MLPβ , arXiv perprint, https://arxiv.org/abs/2306.01470 Multi-layer perceptron (MLP) is a fundamental component of deep learning that has been extensively employed for various problems. However, recent empirical successes in MLP-based architectures, particularly the progress of the MLP-Mixer, have revealed that there is still hidden potential in improving MLPs to achieve better performance. Excluding auxiliary components, the basic block of MLP-Mixer is as follows:
It uses left and right multiplications of matrices. Here we introduce the conjugation operator:
Then and
Thus MLP-Mixer is a kind of MLP with sparse weights (i.e. a lot of connections are set to be zero).
Even if we destroy an architechture of MLP-Mixer by replacing with uniformly distributed random permutation (RP), the accuracy incread as the width increased. We experimentally confirmed that the following hyptothesis by Gouvela et.al. also holds for MLP-Mixer: An increase in the width while maintaining a fixed number of weight parameters leads to an improvement in test accuracy.
A. Golubeva et.al., βAre wider nets better given the same numbr of parameters?β, In *ICLR, *2021.