π Migration note: 18 inline media block(s) (image/file) were not migrated by scripts/copy-post-bodies-to-notes.ts β Notion's API can only re-attach external URLs, not re-upload internal files. The originals remain on the legacy Posts Public page until Phase 8.
Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices appearing in research related to deep neural networks (DNNs). However, the critical assumption of asymptotic freeness of the layerwise Jacobian has not been proven mathematically so far. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key to the proof is an invariance of the MLP.
Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron
Comm. In Math. Phys. 2022, https://link.springer.com/article/10.1007/s00220-022-04441-7.
Joint work with Benoit Collins.
Supported by JST ACT-X and JSPS Sakura Program.
Tomohiro Hayase.
Affiliation: Cluster, Inc., Metaverse Lab.
Talk at 2022/Dec./16β18, Japan-China International Conference on Matrix Theory with Applications .
Table of Contents
- Overview
Deep neural network and Gaussian process
- Jacobian
Stability of DNN and random matrices
- NTK
Training dynamics and random matrices
- Asymptotic Freeness
Main theorem: asymptotic freeness of Jacobians
- Summary
Overview
Multilayer Perceptron
[Figure: https://www.javatpoint.com/multi-layer-perceptron-in-tensorflow]
Let n0β,n1β,β¦,nLββN.
Parameters:
ΞΈ=(Wββ,bββ)β=1,β¦,Lβ,WβββRnββΓnββ1β,bβββRnββ.
Forward propagation: for xβRn0β, set x0β=x and inductively
hββ=Wββxββ1β+bββ,xββ=Ο(hββ):=Ο(hβ,iβ)iβ[nββ]β.
Finally, deinfe the output by fΞΈβ(x)=hLβ.
Ο: Activation Function
Deep Learning
Generally, a standard formulation of supervised deep learning is as follows:
- We are given a finite set of pairs of input/output data D.
- We are given a deep neural network (DNN). They have (one of ) the following conditions:
- Parameterized family of transformations, which maps a real vector to a real vector.
- It is a composition of brief parametrized transformations (e.g. linear transformations and non-linear elementwise function)
- We are given an object function : e.g. mean squared loss
L(x,y,ΞΈ)=2B1βjββ(fΞΈβ(x)jββyjβ)2
Optimization
Gradient descent
ΞΈt+1β=ΞΈtββΞ·tββΞΈββL(x,y,ΞΈ)
Initialization of Parameters
e.g. Gaussian (Ginibre) random matrix
(Wββ)i,jββΌN(0,Οw2β/N),Β i.i.d.
e.g. Haar distributied orthogonal matrix
Wββ=ΟwβO,OβΌHaarΒ OrthogonalΒ Prob.
The Inifnite-dimensional Limit is Gaussian
[Figure: https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html]
Neural Network Gaussian Process (NNGP)
Consider two inputs x,xβ² and corresponding hidden units xββ,xββ²β and hββ,hββ²β in MLP.
Taking the wide limit, we have [Lee+ICLR2018]
(hββ,hββ²β)βΌN(0,Οw2βKββ(x,xβ²)+Οb2β)
where
Kββ(x,xβ²):=nββββlimβnββ1βj=1βnβββxβ,jβxβ,jβ²β.
We have the following Kernel Propagation:
Kβ+1β(x,xβ²)=β«Ο(z1β)Ο(z2β)pNβ(z)dz
where,
pNβ=N(0,Οw2β(Kββ(x,x)Kββ(x,xβ²)βKββ(x,xβ²)Kββ(xβ²,xβ²)β)+Οb2β)
- For some activation, we can compute the integral explicitly.
NNGP Estimation
Generally, consider B samples. Set X=(x(a))a=1,β¦,Bβ,Y=(y(a))a=1,β¦,Bβ be input/output samples.
K(xβ,X):=(KLβ(xβ,x(a)))n=1,β¦,aββRB
K(X,X):=(KLβ(x(a),x(b)))a,bββMBβ(R)
Posteriror mean/ var: For new input xβ,
m(yβ)=K(xβ,X)K(X,X)β1Y
v(yβ)=K(xβ,xβ)βK(xβ,X)K(X,X)β1K(xβ,X)
[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]
Jacobian
Vanishing/Exploding Gradients
The optimization of DNN needs its parameter derivations.
Since a DNN is a composition of functions, the parameter derivations are computed byΒ the chian rule. The input-output Jacobian is defined as
J=βxβfΞΈβ(x)β=βxβhLββ
In the case of MLP, we have
J=WLβDLβ1ββ¦W2βD1βW1β,
where
Dββ=βhβββxβββ=diag(Οβ²(hβ,1β),β¦,Οβ²(hβ,nβββ))
Dynamical Isometry
A DNN is said to achive dynamical isometry If the eigenvalue distribution of JJβ€ is concentrated aound one.
Dynamical Isotmetry prevents the exploding/vanshing gradients.
[Pennington+, AISTATS2018, CH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry.
Set ΞΌLβ,Ξ½ be limit spectral distributions of JJβ€,D2 as wide limits respectively.
Under the assumption of asymptotic freeness of Jacobians,
ΞΌLβ=[(Ο2β
)ββΞ½]β L
where β is the free multiplicative convolution,
Distribution of D2
[Figure: Pennington,Β Schoenholz, Ganguli, AISTATS2018]
The Limit Spectral Distribution ΞΌLβ of JJβ€
[Figure: Pennington,Β Schoenholz, Ganguli, AISTATS2018]
Neural Tangent Kernel
Neural Tangent Kernel
Under continual vertion of GD, learning dynamics of parameters is given by:
dtdΞΈtββ=Ξ·(βΞΈβfΞΈtββ)β€(yβfΞΈtββ)
( * The learning rate Ξ· is fixed.) Then learning dynamics of DNN is given by:
dtdfΞΈββ=Ξ·tβΞtβ(yβfΞΈtββ)
where
Ξtβ=βΞΈβfΞΈtββ(βΞΈβfΞΈtββ)β€
**Informal[Jacot+NeurIPS2018, Lee+NeruIPS2019]**Under the wide limit nββ, the learning dynamics of DNN is approximated by
dtdfΞΈββ=Ξ·Ξ(yβfΞΈtββ)
where the neural tangent kernel is defined as
Ξ:=n2β,β¦,nLβ1βββlimβΞ0β
Neural Tangent Kernel is A Surrogate Model of DNN+GD
Appliable to CNN/ResNet
Figure from [Google βFast and Easy Infinitely Wide Networks with Neural Tangentsβ]
Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]
Eigenvalue Spectrum of NTK
βSpectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networksβ https://arxiv.org/abs/2005.11879
They treats the standard formulation: Gaussian Initialization x Multi-samplesΒ x Small output dimension, and they get:
One-sample NTK
βThe Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometryβ https://arxiv.org/abs/2006.07814 [HK20]
When the DNN achieves dynamical isometry,Β the spectrum of the (one-sample x high-dim output)βNTKβ concentrates around the maximal value, and the maximal values is O(L).
Under an assumption on Asymptotic Freeness, we have the following recursive equations:
Ξβ+1β=qββ+Wβ+1βDββΞββDββWβ+1β€β
ΞΌβ+1β=(qββ+Οβ+12ββ
)ββ(Ξ½βββ ΞΌββ)
NTK & Learning Rate
The spectrum (eigenvalues) of the NTK has vital role in tuning the learning dynamics.
e.g. Ξ·>1/Ξ»maxβ(Ξ)βΉ The learning dynamics does not converge.
e.g. The conditional number c=Ξ»minβ/Ξ»maxβ detemines the converges speed.
Red lineΒ (the boarder line of the exploding gradients) :
This line is expected by our theory !
Asymptotic Freeness
Asymptotic Freeness and Free Probability Theory
Definition(Asymtptotically freeness, Cβ-version)[Voiculescuβ85]
Let (Ajβ(n),Ajβ(n)β)jβJβ be a family of nΓn random matrices and adjoints.
The family is said to be asymptotic free almost surely, if there exists Cβ-probability spaces (Ajβ,Οjβ)jβJβ and elements (ajββAjβ)jβJβ so that for any QβCβ¨Xjβ,Xjβββ£jβJβ©, the following holds:
nββlimβtrnβ[Q(Ajβ(n),Ajβ(n)ββ£jβJ)]=(βjβJβΟjβ)[Q(ajβ,ajβββ£jβJ)]
Example
For NβN, let
- W(N) be Ginibre or Haar orthogonal random matrix,
- D(N) be constant diagonal matrix with a limit distirubution as Nββ.
Then
(W, W^*) andΒ D are a.s. asymptotically free as N \to \infty.
Asymptotic Freeness of Jacobians
Let Wββ,Dββ(β=1,2,β¦,L) be random matrices detemined by MLP.
Theorem [CH22]
Assuming that D1β,β¦,DLβ1β have limit joint moments. Then
(W1β,W1β€β),β¦,(WLβ,WLβ€β),(D1β,β¦,DLβ1β)
are asymptotically freeΒ as nββΒ almost surely.
Difficulty: Entries of DββΒ andΒ Wββ are not independent.
Dββ=diag(Οβ²(hββ))
hββ=Wββxββ
(Sketch of Proof)
Invariance of MLP + Taking submatrix
Construct orthogonal matrix Uββ fixing xββ , i.e.
Uββxββ=xββ,
and
Uβββ£Cxβββ₯ββΌNβ1ΓNβ1Β HaarΒ Orthogonal
with
(U0β,β¦,Uββ)β₯β₯(Wβ+1β,β¦,WLβ).
for β=0,β¦,Lβ1. Then we only need to show the asymptotic freeness of Nβ1ΓNβ1 submatrices of
(U0βW1β,β¦,ULβ1βWLβ),(D1β,β¦,DLβ1β).
Summary
Summary
Considering neural networks with random paramtersβ¦ βΉ
- Tuning initializaiton and learning rate
- Bayesian Estimation with NNGP
- Understanding Dynamics with NTK
Background: Random Matrix Theory and Free Probability Thoery
Future Work
MLP-like NNs were previously only used for toy models, but are now being applied to real-world images and 3D data.
e.g. gMLP, NeRF, etc. Theoretically and practically easy to compute positively, just the right next research target!
[Mildenhall, et.al., ***β*NeRF: Representing Scenes as Neural Radiance Fields for View Synthesisβ, ECCV 2020 ]