Asymptotic Freeness of Jacobians of MLP

2023-01-02 explainer

πŸ“Ž Migration note: 18 inline media block(s) (image/file) were not migrated by scripts/copy-post-bodies-to-notes.ts β€” Notion's API can only re-attach external URLs, not re-upload internal files. The originals remain on the legacy Posts Public page until Phase 8.

Free Probability Theory (FPT) provides rich knowledge for handling mathematical difficulties caused by random matrices appearing in research related to deep neural networks (DNNs). However, the critical assumption of asymptotic freeness of the layerwise Jacobian has not been proven mathematically so far. In this work, we prove asymptotic freeness of layerwise Jacobians of multilayer perceptron (MLP) in this case. A key to the proof is an invariance of the MLP.

Asymptotic Freeness of Layerwise Jacobians Caused by Invariance of Multilayer Perceptron

Comm. In Math. Phys. 2022, https://link.springer.com/article/10.1007/s00220-022-04441-7.

Joint work with Benoit Collins. Supported by JST ACT-X and JSPS Sakura Program.

Tomohiro Hayase.
Affiliation: Cluster, Inc., Metaverse Lab. Talk at 2022/Dec./16β€”18, Japan-China International Conference on Matrix Theory with Applications .


Table of Contents

  1. Overview
    Deep neural network and Gaussian process
  2. Jacobian
    Stability of DNN and random matrices
  3. NTK
    Training dynamics and random matrices
  4. Asymptotic Freeness Main theorem: asymptotic freeness of Jacobians
  5. Summary

Overview


Multilayer Perceptron

[Figure: https://www.javatpoint.com/multi-layer-perceptron-in-tensorflow] Let n0,n1,…,nL∈N.n_0, n_1, \dots, n_L \in \N. Parameters:

ΞΈ=(Wβ„“,bβ„“)β„“=1,…,L,Wβ„“βˆˆRnβ„“Γ—nβ„“βˆ’1,bβ„“βˆˆRnβ„“.\theta = (W_\ell, b_\ell)_{\ell=1, \dots, L}, W_\ell \in \R^{n_\ell \times n_{\ell-1}},b_\ell \in \R^{n_\ell}.

Forward propagation: for x∈Rn0,x \in \R^{n_0}, set x0=xx_0 = x and inductively

hβ„“=Wβ„“xβ„“βˆ’1+bβ„“,xβ„“=Ο†(hβ„“):=Ο†(hβ„“,i)i∈[nβ„“].h_\ell = W_\ell x_{\ell-1} + b_\ell, x_\ell = \varphi(h_\ell):= \varphi(h_{\ell,i})_{i\in[n_\ell]}.

Finally, deinfe the output by fΞΈ(x)=hL.f_\theta(x) = h_L.

Ο†\varphi: Activation Function


Deep Learning

Generally, a standard formulation of supervised deep learning is as follows:

  1. We are given a finite set of pairs of input/output data D\mathcal{D}.
  2. We are given a deep neural network (DNN). They have (one of ) the following conditions:
    1. Parameterized family of transformations, which maps a real vector to a real vector.
    2. It is a composition of brief parametrized transformations (e.g. linear transformations and non-linear elementwise function)
  3. We are given an object function : e.g. mean squared loss
L(x,y,ΞΈ)=12Bβˆ‘j(fΞΈ(x)jβˆ’yj)2L(x,y, \theta) = \frac{1}{2B}\sum_{j} ( f_\theta(x)_j - y_j)^2

Optimization

Gradient descent

ΞΈt+1=ΞΈtβˆ’Ξ·tβˆ‚βˆ‚ΞΈL(x,y,ΞΈ)\theta_{t+1} = \theta_t - \eta_t \frac{\partial}{\partial \theta}L (x,y, \theta)

Initialization of Parameters

e.g. Gaussian (Ginibre) random matrix

(Wβ„“)i,j∼N(0,Οƒw2/N),Β i.i.d.(W_\ell)_{i,j} \sim \mathcal{N}(0, \sigma_w^2/N), \mathrm{\ i.i.d.}

e.g. Haar distributied orthogonal matrix

Wβ„“=ΟƒwO,O∼HaarΒ OrthogonalΒ Prob.W_\ell= \sigma_w O, O\sim \mathrm{Haar \ Orthogonal \ Prob.}

The Inifnite-dimensional Limit is Gaussian

[Figure: https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html]


Neural Network Gaussian Process (NNGP)

Consider two inputs x,xβ€²x, x^\prime and corresponding hidden units xβ„“,xβ„“β€²x_\ell, x^\prime_\ell and hβ„“,hβ„“β€²h_\ell, h^\prime_\ell in MLP. Taking the wide limit, we have [Lee+ICLR2018]

(hβ„“,hβ„“β€²)∼N(0,Οƒw2Kβ„“(x,xβ€²)+Οƒb2)(h_\ell, h_\ell^\prime)\sim \mathcal{N}(0, \sigma_w^2 K_\ell(x,x^\prime) + \sigma_b^2)

where

Kβ„“(x,xβ€²):=lim⁑nβ„“β†’βˆž1nβ„“βˆ‘j=1nβ„“xβ„“,jxβ„“,jβ€².K_\ell(x, x^\prime) := \lim_{n_\ell \to \infty} \frac{1}{n_\ell} \sum_{j=1}^{n_\ell} x_{\ell,j} x^\prime_{\ell,j}.

We have the following Kernel Propagation:

Kβ„“+1(x,xβ€²)=βˆ«Ο†(z1)Ο†(z2)pN(z)dzK_{\ell+1}(x, x^\prime) = \int \varphi(z_1)\varphi(z_2) p_\mathcal{N}(z) dz

where,

pN=N(0,Οƒw2(Kβ„“(x,x)Kβ„“(x,xβ€²)Kβ„“(x,xβ€²)Kβ„“(xβ€²,xβ€²))+Οƒb2)p_\mathcal{N}= \mathcal{N}( 0, \sigma_w^2\begin{pmatrix} K_\ell(x,x) & K_\ell(x,x^\prime)\\ K_\ell(x,x^\prime)& K_\ell(x^\prime,x^\prime)\end{pmatrix} + \sigma_b^2)
  • For some activation, we can compute the integral explicitly.

NNGP Estimation

Generally, consider BB samples. Set X=(x(a))a=1,…,B,Y=(y(a))a=1,…,BX=(x(a))_{a=1,\dots, B}, Y=(y(a))_{a=1, \dots, B} be input/output samples.

K(xβˆ—,X):=(KL(xβˆ—,x(a)))n=1,…,a∈RBK(x^*, X) := (K_L(x^*, x(a)))_{n=1, \dots, a} \in \R^B K(X,X):=(KL(x(a),x(b)))a,b∈MB(R)K(X,X):= ( K_L(x(a), x(b)) )_{a,b} \in M_B(\R)

Posteriror mean/ var: For new input xβˆ—,x^*,

m(yβˆ—)=K(xβˆ—,X)K(X,X)βˆ’1Ym(y^*) = K(x^*,X)K(X,X)^{-1}Y v(yβˆ—)=K(xβˆ—,xβˆ—)βˆ’K(xβˆ—,X)K(X,X)βˆ’1K(xβˆ—,X)v(y^*) = K(x^*, x^*) - K(x^*,X)K(X,X)^{-1}K(x^*, X)

[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]

Jacobian


Vanishing/Exploding Gradients

The optimization of DNN needs its parameter derivations. Since a DNN is a composition of functions, the parameter derivations are computed byΒ  the chian rule. The input-output Jacobian is defined as

J=βˆ‚fΞΈ(x)βˆ‚x=βˆ‚hLβˆ‚xJ = \frac{\partial f_\theta(x)}{\partial x} = \frac{\partial h_L}{\partial x}

In the case of MLP, we have

J=WLDLβˆ’1…W2D1W1,J = W_L D_{L-1} \dots W_2 D_1 W_1,

where

Dβ„“=βˆ‚xβ„“βˆ‚hβ„“=diag(Ο†β€²(hβ„“,1),…,Ο†β€²(hβ„“,nβ„“))D_\ell = \frac{\partial x_\ell}{\partial h_\ell} = \mathrm{diag}( \varphi^\prime(h_{\ell,1}), \dots, \varphi^\prime(h_{\ell,n_\ell}))

Dynamical Isometry

A DNN is said to achive dynamical isometry If the eigenvalue distribution of JJ⊀JJ^\top is concentrated aound one. Dynamical Isotmetry prevents the exploding/vanshing gradients.

[Pennington+, AISTATS2018, CH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry.

Set μL,ν\mu_L, \nu be limit spectral distributions of JJ⊀,D2JJ^\top, D^2 as wide limits respectively.

Under the assumption of asymptotic freeness of Jacobians,

ΞΌL=[(Οƒ2β‹…)βˆ—Ξ½]⊠L\mu_L = [(\sigma^2 \cdot )_* \nu ]^{\boxtimes L}

where ⊠\boxtimes is the free multiplicative convolution,

Distribution of D2D^2

[Figure: Pennington,Β  Schoenholz, Ganguli, AISTATS2018]


The Limit Spectral Distribution μL\mu_L of JJ⊀JJ^\top

[Figure: Pennington,Β  Schoenholz, Ganguli, AISTATS2018]


Neural Tangent Kernel


Neural Tangent Kernel

Under continual vertion of GD, learning dynamics of parameters is given by:

dΞΈtdt=Ξ·(βˆ‡ΞΈfΞΈt)⊀(yβˆ’fΞΈt)\frac{d\theta_t}{dt} = \eta (\nabla_\theta f_{\theta_t})^\top (y - f_{\theta_t})

( * The learning rate Ξ·\eta is fixed.) Then learning dynamics of DNN is given by:

dfΞΈdt=Ξ·tΘt(yβˆ’fΞΈt)\frac{df_{\theta}}{dt} = \eta_t \Theta_t(y-f_{\theta_t})

where

Θt=βˆ‡ΞΈfΞΈt(βˆ‡ΞΈfΞΈt)⊀\Theta_t = \nabla_\theta f_{\theta_t}( \nabla_\theta f_{\theta_t})^\top

**Informal[Jacot+NeurIPS2018, Lee+NeruIPS2019]**Under the wide limit nβ†’βˆžn \to \infty, the learning dynamics of DNN is approximated by

dfΞΈdt=ηΘ(yβˆ’fΞΈt)\frac{df_{\theta}}{dt} = \eta \Theta(y-f_{\theta_t})

where the neural tangent kernel is defined as

Θ:=lim⁑n2,…,nLβˆ’1β†’βˆžΞ˜0\Theta := \lim_{n_2, \dots, n_{L-1} \to \infty} \Theta_0

Neural Tangent Kernel is A Surrogate Model of DNN+GD

[Figure from Google β€œFast and Easy Infinitely Wide Networks with Neural Tangents”]

Appliable to CNN/ResNet

Figure from [Google β€œFast and Easy Infinitely Wide Networks with Neural Tangents”]

Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]

Eigenvalue Spectrum of NTK

β€œSpectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks” https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samplesΒ  x Small output dimension, and they get:


One-sample NTK

β€œThe Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814 [HK20] When the DNN achieves dynamical isometry,Β  the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). Under an assumption on Asymptotic Freeness, we have the following recursive equations:

Ξ˜β„“+1=qβ„“+Wβ„“+1Dβ„“Ξ˜β„“Dβ„“Wβ„“+1⊀\Theta_{\ell+1} = q_\ell + W_{\ell+1} D_\ell \Theta_\ell D_\ell W_{\ell+1}^\top ΞΌβ„“+1=(qβ„“+Οƒβ„“+12β‹…)βˆ—(Ξ½β„“βŠ ΞΌβ„“)\mu_{\ell + 1} = (q_\ell + \sigma_{\ell+1}^2 \cdot)_* (\nu_\ell \boxtimes \mu_\ell)

NTK & Learning Rate

The spectrum (eigenvalues) of the NTK has vital role in tuning the learning dynamics. e.g. η>1/λmax(Θ)⟹\eta > 1/ \lambda_\mathrm{max} (\Theta) \Longrightarrow The learning dynamics does not converge. e.g. The conditional number c=λmin/λmaxc = \lambda_\mathrm{min}/ \lambda_\mathrm{max} detemines the converges speed.

Red lineΒ  (the boarder line of the exploding gradients) : This line is expected by our theory !


Asymptotic Freeness


Asymptotic Freeness and Free Probability Theory

Definition(Asymtptotically freeness, Cβˆ—^*-version)[Voiculescu’85] Let (Aj(n),Aj(n)βˆ—)j∈J(A_j(n), A_j(n)^*)_{j \in J} be a family of nΓ—nn \times n random matrices and adjoints. The family is said to be asymptotic free almost surely, if there exists Cβˆ—^*-probability spaces (Aj,Ο„j)j∈J(\mathfrak{A}_j, \tau_j)_{j \in J} and elements (aj∈Aj)j∈J(a_j \in \mathfrak{A}_j)_{j \in J} so that for any Q∈C⟨Xj,Xjβˆ—βˆ£j∈J⟩Q \in \mathbb{C} \langle X_j, X_j^* \mid j \in J \rangle, the following holds:

lim⁑nβ†’βˆžtrn[Q(Aj(n),Aj(n)βˆ—βˆ£j∈J)]=(βˆ—j∈JΟ„j)[Q(aj,ajβˆ—βˆ£j∈J)]\lim_{n \to \infty} \mathrm{tr}_n \left[Q(A_j(n), A_j(n)^* \mid j \in J ) \right] \\ = (*_{j \in J} \tau_j) \left[ Q(a_j, a_j^* \mid j \in J) \right]

Example

For N∈N,N \in \N, let

  • W(N)W(N) be Ginibre or Haar orthogonal random matrix,
  • D(N)D(N) be constant diagonal matrix with a limit distirubution as Nβ†’βˆž.N \to \infty. Then (W, W^*) andΒ  D are a.s. asymptotically free as N \to \infty.

Asymptotic Freeness of Jacobians

Let Wβ„“,Dβ„“(β„“=1,2,…,L)W_\ell, D_{\ell} (\ell=1,2,\dots, L) be random matrices detemined by MLP. Theorem [CH22] Assuming that D1,…,DLβˆ’1D_1, \dots, D_{L-1} have limit joint moments. Then

(W1,W1⊀),…,(WL,WL⊀),(D1,…,DLβˆ’1)(W_1, W_1^\top), \dots, (W_L, W_L^\top), (D_1, \dots, D_{L-1})

are asymptotically freeΒ  as nβ†’βˆžn \to \inftyΒ  almost surely. Difficulty: Entries of Dβ„“Β andΒ Wβ„“D_\ell \text{\ and }W_\ell are not independent. Dβ„“=diag(Ο†β€²(hβ„“))D_\ell = \mathrm{diag}(\varphi^\prime(h_\ell)) hβ„“=Wβ„“xβ„“h_\ell = W_\ell x_\ell

(Sketch of Proof) Invariance of MLP + Taking submatrix Construct orthogonal matrix Uβ„“U_\ell fixing xβ„“x_{\ell} , i.e.

Uβ„“xβ„“=xβ„“,U_\ell x_\ell = x_\ell,

and

Uβ„“βˆ£Cxβ„“βŠ₯∼Nβˆ’1Γ—Nβˆ’1Β HaarΒ OrthogonalU_\ell |_{{\mathbb{C}x_\ell}^\bot} \sim N-1 \times N-1 \text{\ Haar Orthogonal}

with

(U0,…,Uβ„“)βŠ₯ ⁣ ⁣ ⁣ ⁣βŠ₯(Wβ„“+1,…,WL).(U_0, \dots, U_\ell) \perp\!\!\!\!\perp(W_{\ell+1}, \dots, W_L).

for β„“=0,…,Lβˆ’1.\ell=0,\dots, L-1. Then we only need to show the asymptotic freeness of Nβˆ’1Γ—Nβˆ’1N-1 \times N-1 submatrices of

(U0W1,…,ULβˆ’1WL),(D1,…,DLβˆ’1).(U_0W_1, \dots, U_{L-1}W_L), (D_1, \dots, D_{L-1}).

Summary


Summary

Considering neural networks with random paramters… ⟹\Longrightarrow

  1. Tuning initializaiton and learning rate
  2. Bayesian Estimation with NNGP
  3. Understanding Dynamics with NTK

Background: Random Matrix Theory and Free Probability Thoery


Future Work

MLP-like NNs were previously only used for toy models, but are now being applied to real-world images and 3D data. e.g. gMLP, NeRF, etc. Theoretically and practically easy to compute positively, just the right next research target! [Mildenhall, et.al., ***β€œ*NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, ECCV 2020 ]