📎 Migration note: 20 inline media block(s) (image/file) were not migrated by scripts/copy-post-bodies-to-notes.ts — Notion's API can only re-attach external URLs, not re-upload internal files. The originals remain on the legacy Posts Public page until Phase 8.

機械学習において中心的役割を果たしているのが深層神経回路である。深層神経回路を理解する時、ランダム行列が現れることがある。特に、パラメータの初期化や学習ダイナミクスの理解において、頻繁にランダム行列が現れる。この講演では、このようなランダム行列と深層神経回路に関連するトピック、特に自由確率論、平均場、NNGP、NTKなどを概観する。

The central role in machine learning is played by deep neural networks. When understanding deep neural networks, random matrices often appear. In particular, random matrices frequently arise in the context of parameter initialization and understanding learning dynamics. In this lecture, we will provide an overview of topics related to random matrices and deep neural networks, with a focus on free probability theory, mean field theory, NNGP (Neural Network Gaussian Process), NTK (Neural Tangent Kernel), and others.

Random Matrices and Deep Neural Networks

Talk at 2023/July/12, IMI
Tomohiro Hayase,
Senior Reseach Scientist
Cluster Metaverse Lab.

Table of Contents

Overview
Deep neural network and Gaussian process
Jacobian
Stability of DNN and random matrices
NTK
Training dynamics and random matrices
Asymptotic Freeness Main theorem: asymptotic freeness of Jacobians
Summary & In Progress

Overview

Multilayer Perceptron

[Figure: https://www.javatpoint.com/multi-layer-perceptron-in-tensorflow] Let $n_0, n_1, \dots, n_L \in \N.$ Parameters:

\theta = (W_\ell, b_\ell)_{\ell=1, \dots, L}, W_\ell \in \R^{n_\ell \times n_{\ell-1}},b_\ell \in \R^{n_\ell}.

Forward propagation: for $x \in \R^{n_0},$ set $x_0 = x$ and inductively

h_\ell = W_\ell x_{\ell-1} + b_\ell, x_\ell = \varphi(h_\ell):= \varphi(h_{\ell,i})_{i\in[n_\ell]}.

Finally, definne the output by $f_\theta(x) = h_L.$

$\varphi$ : Activation Function

Deep Learning

Generally, a standard formulation of supervised deep learning is as follows:

We are given a finite set of pairs of input/output data $(x,y) \in \mathcal{D}$ .
We are given a deep neural network (DNN), which is a composition of (parameterized) transformations, which maps a real vector to a real vector.
We are given an object function : e.g. mean squared loss

L(x,y, \theta) = \frac{1}{2n_L}\sum_{j=1}^{n_L} ( f_\theta(x)_j - y_j)^2,

Optimization

We minimize the loss function by gradient descent:

\theta_{t+1} = \theta_t - \eta_t \frac{\partial}{\partial \theta}L (x,y, \theta_t)

Initialization of Parameters and Random Matrices

e.g. Gaussian (Ginibre) random matrix:

(W_\ell)_{i,j} \sim \mathcal{N}(0, \sigma_w^2/n_\ell), \mathrm{\ i.i.d.}

e.g. Haar distributied orthogonal matrix:

W_\ell= \sigma_w O, O\sim \mathrm{Haar \ Orthogonal \ Prob.}

The Inifnite-dimensional Limit is Gaussian

[Figure: https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html]

Neural Network Gaussian Process (NNGP)

Consider two inputs $x, x^\prime$ and corresponding hidden units $x_\ell, x^\prime_\ell$ and $h_\ell, h^\prime_\ell$ in MLP. Taking an inifnite dimensional limit at the initial state, we have [Lee+ICLR2018]

(h_\ell, h_\ell^\prime)\sim \mathcal{N}(0, \sigma_w^2 K_\ell(x,x^\prime) + \sigma_b^2)

where

K_\ell(x, x^\prime) := \lim_{n_\ell \to \infty} \frac{1}{n_\ell} \sum_{j=1}^{n_\ell} x_{\ell,j} x^\prime_{\ell,j}.

We have the following Kernel Propagation:

K_{\ell+1}(x, x^\prime) = \int \varphi(z_1)\varphi(z_2) p_\mathcal{N}(z) dz,

where

p_\mathcal{N}= \mathcal{N}( 0, \sigma_w^2\begin{pmatrix} K_\ell(x,x) & K_\ell(x,x^\prime)\\ K_\ell(x,x^\prime)& K_\ell(x^\prime,x^\prime)\end{pmatrix} + \sigma_b^2).

For some activation functions, we can compute the integral explicitly.

Application: NNGP Estimation

Generally, consider $B$ samples. Set $X=(x(a))_{a=1,\dots, B}, Y=(y(a))_{a=1, \dots, B}$ be input/output samples.

K(x^*, X) := (K_L(x^*, x(a)))_{n=1, \dots, a} \in \R^B

K(X,X):= ( K_L(x(a), x(b)) )_{a,b} \in M_B(\R)

Then the posterior mean/ var is given by the following : for a new input $x^*,$

m(y^*) = K(x^*,X)K(X,X)^{-1}Y

v(y^*) = K(x^*, x^*) - K(x^*,X)K(X,X)^{-1}K(x^*, X)

[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]

Jacobian

Vanishing/Exploding Gradients

The optimization of DNN needs its parameter derivations. Since a DNN is a composition of functions, the parameter derivations are computed by the chian rule. The input-output Jacobian is defined as

J = \frac{\partial f_\theta(x)}{\partial x} = \frac{\partial h_L}{\partial x}

In the case of MLP, we have

J = W_L D_{L-1} \dots W_2 D_1 W_1,

where

D_\ell = \frac{\partial x_\ell}{\partial h_\ell} = \mathrm{diag}( \varphi^\prime(h_{\ell,1}), \dots, \varphi^\prime(h_{\ell,n_\ell}))

Dynamical Isometry

A DNN is said to achive dynamical isometry If the eigenvalue distribution of $JJ^\top$ is concentrated aound one. Dynamical Isotmetry prevents the exploding/vanshing gradients.

[Pennington+, AISTATS2018, CH, CIMP2022] If we set the initialization of parameters to be Haar orthgonal and choose appropriate activation function, then we can make the DNN to achieve the dynamical isometry.

Set $\mu_L, \nu$ be limit spectral distributions of $JJ^\top, D^2$ as wide limits respectively.

Under the assumption of the asymptotic freeness of Jacobians,

\mu_L = [(\sigma^2 \cdot )_* \nu ]^{\boxtimes L}

where $\boxtimes$ is the free multiplicative convolution,

Distribution of $D^2$

[Figure: Pennington, Schoenholz, Ganguli, AISTATS2018]

The Limit Spectral Distribution $\mu_L$ of $JJ^\top$

[Figure: Pennington, Schoenholz, Ganguli, AISTATS2018]

Neural Tangent Kernel

Under continual version of GD, learning dynamics of parameters is given by:

\frac{d\theta_t}{dt} = \eta (\nabla_\theta f_{\theta_t})^\top (y - f_{\theta_t})

( * The learning rate $\eta$ is fixed.) Then learning dynamics of DNN is given by:

\frac{df_{\theta}}{dt} = \eta\Theta_t(y-f_{\theta_t})

where

\Theta_t = \nabla_\theta f_{\theta_t}( \nabla_\theta f_{\theta_t})^\top

**Informal[Jacot+NeurIPS2018, Lee+NeruIPS2019]**Under the wide limit $n \to \infty$ , the learning dynamics of DNN is approximated by

\frac{df_{\theta}}{dt} = \eta \Theta(y-f_{\theta_t})

where the neural tangent kernel is defined as

\Theta := \lim_{n_2, \dots, n_{L-1} \to \infty} \Theta_0

Neural Tangent Kernel is A Surrogate Model of DNN+GD

Based of NTK, we can do Bayesian estimation in the same way as NNGP. Moreover, with NTK, we can simulate the gradient descent at any step $t$ of ensemble newtorks. [Figure from Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”]

Appliable to CNN/ResNet

Figure from [Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”]

Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]

Eigenvalue Spectrum of NTK

“Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks” Z. Fan & Z. Wang https://arxiv.org/abs/2005.11879 They treats the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get an recurrence equation of the limit spetreatctral distribution of NTK. Figures: Red lines are theoretical prediction

One-sample NTK

TH& R.Karakia “The Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometry” https://arxiv.org/abs/2006.07814, In AISTATS 2020. When the DNN achieves dynamical isometry, the spectrum of the (one-sample x high-dim output)”NTK” concentrates around the maximal value, and the maximal values is O(L). (Sketch)Under an assumption on Asymptotic Freeness, we have the following recursive equations:

\Theta_{\ell+1} = q_\ell + W_{\ell+1} D_\ell \Theta_\ell D_\ell W_{\ell+1}^\top

\mu_{\ell + 1} = (q_\ell + \sigma_{\ell+1}^2 \cdot)_* (\nu_\ell \boxtimes \mu_\ell)

NTK & Learning Rate

The spectrum (eigenvalues) of the NTK has vital role in tuning the learning dynamics. e.g. $\eta > 1/ \lambda_\mathrm{max} (\Theta) \Longrightarrow$ The learning dynamics does not converge. e.g. The conditional number $c = \lambda_\mathrm{min}/ \lambda_\mathrm{max}$ detemines the converges speed.

Red line (the boarder line of the exploding gradients) : This line is expected by our theory !

Asymptotic Freeness

Asymptotic Freeness and Free Probability Theory

Definition(Asymtptotic freeness, C $^*$ -version)[Voiculescu’85] Let $(A_j(n), A_j(n)^*)_{j \in J}$ be a family of $n \times n$ random matrices and adjoints. The family is said to be asymptotically free almost surely, if there exists C $^*$ -probability spaces $(\mathfrak{A}_j, \tau_j)_{j \in J}$ and elements $(a_j \in \mathfrak{A}_j)_{j \in J}$ so that for any $Q \in \mathbb{C} \langle X_j, X_j^* \mid j \in J \rangle$ , the following holds:

\lim_{n \to \infty} \mathrm{tr}_n \left[Q(A_j(n), A_j(n)^* \mid j \in J ) \right] \\ = (*_{j \in J} \tau_j) \left[ Q(a_j, a_j^* \mid j \in J) \right],

where $*_{j \in J} \tau_j$ is the free product of the tracial states.

Example

For $N \in \N,$ let

$W(N)$ be Ginibre or Haar orthogonal random matrix,
$D(N)$ be constant diagonal matrix with a limit distirubution as $N \to \infty.$ Then $(W, W^*)$ and $D$ are a.s. asymptotically free as N \to \infty.

Asymptotic Freeness of Jacobians

Let $W_\ell, D_{\ell} (\ell=1,2,\dots, L)$ be weight matrices in MLP and $W_\ell$ be scaled Haar orthogonal random matrices. (The Gaussian case is treated by : [B. Hanin and M. Nica.], [L. Pastur. ], [G.Yang] ) Theorem [CH22] Assuming that $D_1, \dots, D_{L-1}$ have limit joint moments. Then

(W_1, W_1^\top), \dots, (W_L, W_L^\top), (D_1, \dots, D_{L-1})

are asymptotically free as $n \to \infty$ almost surely.

Difficulty: Entries of $D_\ell \text{\ and }W_\ell$ are not independent. $D_\ell = \mathrm{diag}(\varphi^\prime(h_\ell))$ $h_\ell = W_\ell x_\ell$

(Sketch of Proof) Invariance of MLP + Taking submatrix Construct orthogonal matrix $U_\ell$ fixing $x_{\ell}$ , i.e.

U_\ell x_\ell = x_\ell,

and

U_\ell |_{{\mathbb{R}x_\ell}^\bot} \sim N-1 \times N-1 \text{\ Haar Orthogonal}

with

(U_0, \dots, U_\ell) \perp\!\!\!\!\perp(W_{\ell+1}, \dots, W_L).

for $\ell=0,\dots, L-1.$ Then we only need to show the asymptotic freeness of $N-1 \times N-1$ submatrices of

(U_0W_1, \dots, U_{L-1}W_L), (D_1, \dots, D_{L-1}).

Summary

Considering neural networks with random paramters… $\Longrightarrow$

Tuning initializaiton and learning rate
Bayesian Estimation with NNGP
Understanding Dynamics with NTK
If we focus on the spectrum, free probability appears in the theory.

In Progress: Theoretical Understanding of NeRF

MLP-like NNs were previously only used for toy models, but are now being applied to real-world 2D images and 3D data. e.g. MLP-Mixer, NeRF, etc. Theoretically and practically easy to compute positively, just the right next research target! [Mildenhall, et.al., “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, ECCV 2020 ]

In progress: MLP-Mixer as a wide and sparse MLP

TH& R. Karakida “MLPMixer as a wide and sparse MLP” , arXiv perprint, https://arxiv.org/abs/2306.01470 Multi-layer perceptron (MLP) is a fundamental component of deep learning that has been extensively employed for various problems. However, recent empirical successes in MLP-based architectures, particularly the progress of the MLP-Mixer, have revealed that there is still hidden potential in improving MLPs to achieve better performance. Excluding auxiliary components, the basic block of MLP-Mixer is as follows:

Y = \phi( V\phi( XW^\top))

It uses left and right multiplications of matrices. Here we introduce the conjugation operator:

J: \mathrm{Vec}(X) \mapsto \mathrm{Vec}(X^\top)

Then $J\phi = \phi J$ and

Y = \phi(V \phi(J^\top WJX)) = \mathrm{Mat}(\phi(1 \otimes V) \phi(J^\top (1\otimes W)J \mathrm{Vec}(X))).

Thus MLP-Mixer is a kind of MLP with sparse weights (i.e. a lot of connections are set to be zero).

Even if we destroy an architechture of MLP-Mixer by replacing $J$ with uniformly distributed random permutation (RP), the accuracy incread as the width increased. We experimentally confirmed that the following hyptothesis by Gouvela et.al. also holds for MLP-Mixer: An increase in the width while maintaining a fixed number of weight parameters leads to an improvement in test accuracy.

A. Golubeva et.al., “Are wider nets better given the same numbr of parameters?”, In *ICLR, *2021.

IMI: ランダム行列と深層神経回路

Random Matrices and Deep Neural Networks

Overview

Multilayer Perceptron

Finally, definne the output by fθ(x)=hL.f_\theta(x) = h_L.fθ​(x)=hL​.

φ\varphiφ: Activation Function

Deep Learning

Optimization

Initialization of Parameters and Random Matrices

The Inifnite-dimensional Limit is Gaussian

Neural Network Gaussian Process (NNGP)

Application: NNGP Estimation

[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]

Jacobian

Vanishing/Exploding Gradients

Dynamical Isometry

where ⊠\boxtimes⊠ is the free multiplicative convolution,

Distribution of D2D^2D2

The Limit Spectral Distribution μL\mu_LμL​ of JJ⊤JJ^\topJJ⊤

Neural Tangent Kernel

Neural Tangent Kernel

Neural Tangent Kernel is A Surrogate Model of DNN+GD

Based of NTK, we can do Bayesian estimation in the same way as NNGP. Moreover, with NTK, we can simulate the gradient descent at any step ttt of ensemble newtorks. [Figure from Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”]

Appliable to CNN/ResNet

Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]

Eigenvalue Spectrum of NTK

One-sample NTK

NTK & Learning Rate

Asymptotic Freeness

Asymptotic Freeness and Free Probability Theory

where ∗j∈Jτj*_{j \in J} \tau_j∗j∈J​τj​ is the free product of the tracial states.

Example

Asymptotic Freeness of Jacobians

Summary

Summary

In Progress: Theoretical Understanding of NeRF

In progress: MLP-Mixer as a wide and sparse MLP

Finally, definne the output by $f_\theta(x) = h_L.$

$\varphi$ : Activation Function

where $\boxtimes$ is the free multiplicative convolution,

Distribution of $D^2$

The Limit Spectral Distribution $\mu_L$ of $JJ^\top$

Based of NTK, we can do Bayesian estimation in the same way as NNGP. Moreover, with NTK, we can simulate the gradient descent at any step $t$ of ensemble newtorks. [Figure from Google “Fast and Easy Infinitely Wide Networks with Neural Tangents”]

where $*_{j \in J} \tau_j$ is the free product of the tracial states.