Random Matrices, Free Probability, and Deep Neural Networks

2024-09-09 explainer

๐Ÿ“Ž Migration note: 23 inline media block(s) (image/file) were not migrated by scripts/copy-post-bodies-to-notes.ts โ€” Notion's API can only re-attach external URLs, not re-upload internal files. The originals remain on the legacy Posts Public page until Phase 8.

Random Matrices, Free probability, and Deep Neural Networks

Tomohiro Hayase,
Cluster Metaverse Lab, Japan. Talk at Workshop: Bridges between Machine Learning and Quantum Information Science (CIRM) 2024/9/09-10 and T.H. and Benoit Collins, โ€œAsymptotic freeness of layerwise Jacobians caused by invariance of multilayer perceptronโ€, in Comm. In Math. Phys. (2023), https://link.springer.com/article/10.1007/s00220-022-04441-7.

Based on T.H and Ryo Karakida, โ€œUnderstanding MLP-Mixer as a Wide and Sparse MLPโ€, in ICML2024,

Table of Contents

  1. Overview
    Deep neural network and Gaussian process

  2. Jacobian
    Stability of DNN and random matrices

  3. NTK
    Training dynamics and random matrices

  4. Asymptotic Freeness Main theorem: asymptotic freeness of Jacobians

  5. MLP-Mixer

    1. Preliminaries
    2. Symmirarity between MLP and MLP-Mixer
    3. Effective Width
    4. Monarch Matrices
  6. Alternative to static sparse weight MLP

    1. Random Permuted Mixers
    2. Revisit the similarity in wider cases
  7. Summary and Future Work


1.Overview


Multilayer Perceptron

[Figure: https://www.javatpoint.com/multi-layer-perceptron-in-tensorflow] Let n0,n1,โ€ฆ,nLโˆˆN.n_0, n_1, \dots, n_L \in \N. Parameters:

ฮธ=(Wโ„“,bโ„“)โ„“=1,โ€ฆ,L,Wโ„“โˆˆRnโ„“ร—nโ„“โˆ’1,bโ„“โˆˆRnโ„“.\theta = (W_\ell, b_\ell)_{\ell=1, \dots, L}, W_\ell \in \R^{n_\ell \times n_{\ell-1}},b_\ell \in \R^{n_\ell}.

Forward propagation: for xโˆˆRn0,x \in \R^{n_0}, set x0=xx_0 = x and inductively

hโ„“=Wโ„“xโ„“โˆ’1+bโ„“,xโ„“=ฯ†(hโ„“):=ฯ†(hโ„“,i)iโˆˆ[nโ„“].h_\ell = W_\ell x_{\ell-1} + b_\ell, x_\ell = \varphi(h_\ell):= \varphi(h_{\ell,i})_{i\in[n_\ell]}.

Finally, define the output by fฮธ(x)=hL.f_\theta(x) = h_L.

ฯ†\varphi: Activation Function


Deep Learning

Generally, a standard formulation of supervised deep learning is as follows:

  1. We are given a finite set of pairs of input/output data (x,y)โˆˆD(x,y) \in \mathcal{D}.
  2. We are given a deep neural network (DNN), which is a composition of (parameterized) transformations that maps a real vector to a real vector.
  3. We are given an object function: e.g. mean squared loss
L(x,y,ฮธ)=12nLโˆ‘j=1nL(fฮธ(x)jโˆ’yj)2,L(x,y, \theta) = \frac{1}{2n_L}\sum_{j=1}^{n_L} ( f_\theta(x)_j - y_j)^2,

Optimization

We minimize the loss function by gradient descent:

ฮธt+1=ฮธtโˆ’ฮทtโˆ‚โˆ‚ฮธL(x,y,ฮธt)\theta_{t+1} = \theta_t - \eta_t \frac{\partial}{\partial \theta}L (x,y, \theta_t)

Initialization of Parameters and Random Matrices

e.g. Gaussian (Ginibre) random matrix:

(Wโ„“)i,jโˆผN(0,ฯƒw2/N),ย i.i.d.(W_\ell)_{i,j} \sim \mathcal{N}(0, \sigma_w^2/N), \mathrm{\ i.i.d.}

e.g. Haar distributed orthogonal matrix:

Wโ„“=ฯƒwO,OโˆผHaarย Orthogonalย Prob.W_\ell= \sigma_w O, O\sim \mathrm{Haar \ Orthogonal \ Prob.}

The Infinite-dimensional Limit is Gaussian

[Figure: https://ai.googleblog.com/2020/03/fast-and-easy-infinitely-wide-networks.html]


Neural Network Gaussian Process (NNGP)

Consider two inputs x,xโ€ฒx, x^\prime and corresponding hidden units xโ„“,xโ„“โ€ฒx_\ell, x^\prime_\ell and hโ„“,hโ„“โ€ฒh_\ell, h^\prime_\ell in MLP. Taking an infinite dimensional limit at the initial state, we have [Lee+ICLR2018]

(hโ„“,hโ„“โ€ฒ)โˆผN(0,ฯƒw2Kโ„“(x,xโ€ฒ)+ฯƒb2)(h_\ell, h_\ell^\prime)\sim \mathcal{N}(0, \sigma_w^2 K_\ell(x,x^\prime) + \sigma_b^2)

where

Kโ„“(x,xโ€ฒ):=limโกnโ„“โ†’โˆž1nโ„“โˆ‘j=1nโ„“xโ„“,jxโ„“,jโ€ฒ.K_\ell(x, x^\prime) := \lim_{n_\ell \to \infty} \frac{1}{n_\ell} \sum_{j=1}^{n_\ell} x_{\ell,j} x^\prime_{\ell,j}.

We have the following Kernel Propagation:

Kโ„“+1(x,xโ€ฒ)=โˆซฯ†(z1)ฯ†(z2)pN(z)dz,K_{\ell+1}(x, x^\prime) = \int \varphi(z_1)\varphi(z_2) p_\mathcal{N}(z) dz,

where

pN=N(0,ฯƒw2(Kโ„“(x,x)Kโ„“(x,xโ€ฒ)Kโ„“(x,xโ€ฒ)Kโ„“(xโ€ฒ,xโ€ฒ))+ฯƒb2).p_\mathcal{N}= \mathcal{N}( 0, \sigma_w^2\begin{pmatrix} K_\ell(x,x) & K_\ell(x,x^\prime)\\ K_\ell(x,x^\prime)& K_\ell(x^\prime,x^\prime)\end{pmatrix} + \sigma_b^2).
  • For some activation functions, we can compute the integral explicitly.

Application: NNGP Estimation

Generally, consider BB samples. Set X=(x(a))a=1,โ€ฆ,B,Y=(y(a))a=1,โ€ฆ,BX=(x(a))_{a=1,\dots, B}, Y=(y(a))_{a=1, \dots, B} be input/output samples.

K(xโˆ—,X):=(KL(xโˆ—,x(a)))n=1,โ€ฆ,aโˆˆRBK(x^*, X) := (K_L(x^*, x(a)))_{n=1, \dots, a} \in \R^B K(X,X):=(KL(x(a),x(b)))a,bโˆˆMB(R)K(X,X):= ( K_L(x(a), x(b)) )_{a,b} \in M_B(\R)

Then the posterior mean/ var is given by the following: for a new input xโˆ—,x^*,

m(yโˆ—)=K(xโˆ—,X)K(X,X)โˆ’1Ym(y^*) = K(x^*,X)K(X,X)^{-1}Y v(yโˆ—)=K(xโˆ—,xโˆ—)โˆ’K(xโˆ—,X)K(X,X)โˆ’1K(xโˆ—,X)v(y^*) = K(x^*, x^*) - K(x^*,X)K(X,X)^{-1}K(x^*, X)

[Lee et.al., Deep Neural Networks as Gaussian Process, ICLR 2018]

2.Jacobian


Vanishing/Exploding Gradients

The optimization of DNN needs its parameter derivations. Since a DNN is a function composition, the chain rule computes the parameter derivations. The input-output Jacobian is defined as

J=โˆ‚fฮธ(x)โˆ‚x=โˆ‚hLโˆ‚xJ = \frac{\partial f_\theta(x)}{\partial x} = \frac{\partial h_L}{\partial x}

In the case of MLP, we have

J=WLDLโˆ’1โ€ฆW2D1W1,J = W_L D_{L-1} \dots W_2 D_1 W_1,

where

Dโ„“=โˆ‚xโ„“โˆ‚hโ„“=diag(ฯ†โ€ฒ(hโ„“,1),โ€ฆ,ฯ†โ€ฒ(hโ„“,nโ„“))D_\ell = \frac{\partial x_\ell}{\partial h_\ell} = \mathrm{diag}( \varphi^\prime(h_{\ell,1}), \dots, \varphi^\prime(h_{\ell,n_\ell}))

Dynamical Isometry

A DNN is said to achieve dynamical isometry If the eigenvalue distribution is concentrated around one. Dynamical Isotmetry prevents the exploding/vanishing gradients.

[Pennington+, AISTATS2018, CH, CIMP2022] If we set the initialization of parameters to be Haar orthogonal and choose the appropriate activation function, then we can make the DNN to achieve the dynamical isometry.

Set ฮผL,ฮฝ\mu_L, \nu be limit spectral distributions of JJโŠค,D2JJ^\top, D^2 as wide limits respectively.

Under the assumption of the asymptotic freeness of Jacobians,

ฮผL=[(ฯƒ2โ‹…)โˆ—ฮฝ]โŠ L\mu_L = [(\sigma^2 \cdot )_* \nu ]^{\boxtimes L}

where โŠ \boxtimes is the free multiplicative convolution,

Distribution of D2D^2

[Figure: Pennington,ย  Schoenholz, Ganguli, AISTATS2018]


The Limit Spectral Distribution ฮผL\mu_L of JJโŠคJJ^\top

[Figure: Pennington,ย  Schoenholz, Ganguli, AISTATS2018]


3.Neural Tangent Kernel


Neural Tangent Kernel

Under continual version of GD, learning dynamics of parameters is given by:

dฮธtdt=ฮท(โˆ‡ฮธfฮธt)โŠค(yโˆ’fฮธt)\frac{d\theta_t}{dt} = \eta (\nabla_\theta f_{\theta_t})^\top (y - f_{\theta_t})

( * The learning rate ฮท\eta is fixed.) Then, learning dynamics of DNN is given by:

dfฮธdt=ฮทtฮ˜t(yโˆ’fฮธt)\frac{df_{\theta}}{dt} = \eta_t \Theta_t(y-f_{\theta_t})

where

ฮ˜t=โˆ‡ฮธfฮธt(โˆ‡ฮธfฮธt)โŠค\Theta_t = \nabla_\theta f_{\theta_t}( \nabla_\theta f_{\theta_t})^\top

**Informal[Jacot+NeurIPS2018, Lee+NeruIPS2019]**Under the wide limit nโ†’โˆžn \to \infty, the learning dynamics of DNN are approximated by

dfฮธdt=ฮทฮ˜(yโˆ’fฮธt)\frac{df_{\theta}}{dt} = \eta \Theta(y-f_{\theta_t})

where the neural tangent kernel is defined as

ฮ˜:=limโกn2,โ€ฆ,nLโˆ’1โ†’โˆžฮ˜0\Theta := \lim_{n_2, \dots, n_{L-1} \to \infty} \Theta_0

The neural Tangent Kernel is A Surrogate Model of DNN+GD

Based of NTK, we can do Bayesian estimation in the same way as NNGP. Moreover, with NTK, we can simulate the gradient descent at any step tt of ensemble networks. [Figure from Google โ€œFast and Easy Infinitely Wide Networks with Neural Tangentsโ€]

Appliable to CNN/ResNet

Figure from [Google โ€œFast and Easy Infinitely Wide Networks with Neural Tangentsโ€]

Moreover, NTK is appliable to Attention: Infinite attention: NNGP and NTK for deep attention networks [https://arxiv.org/abs/2006.10540\]

Eigenvalue Spectrum of NTK

โ€œSpectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networksโ€ Z. Fan & Z. Wang https://arxiv.org/abs/2005.11879 They treat the standard formulation: Gaussian Initialization x Multi-samples x Small output dimension, and they get a recurrence equation of the limit spectral distribution of NTK. Figures: Red lines are theoretical prediction


One-sample NTK

TH& R.Karakia โ€œThe Spectrum of Fisher Information of Deep Networks Achieving Dynamical Isometryโ€ https://arxiv.org/abs/2006.07814, In AISTATS 2020. When the DNN achieves dynamical isometry,ย  the spectrum of the (one-sample x high-dim output)โ€ NTKโ€ concentrates around the maximal value, and the maximal value is O(L). (Sketch)Under an assumption on Asymptotic Freeness, we have the following recursive equations:

ฮ˜โ„“+1=qโ„“+Wโ„“+1Dโ„“ฮ˜โ„“Dโ„“Wโ„“+1โŠค\Theta_{\ell+1} = q_\ell + W_{\ell+1} D_\ell \Theta_\ell D_\ell W_{\ell+1}^\top ฮผโ„“+1=(qโ„“+ฯƒโ„“+12โ‹…)โˆ—(ฮฝโ„“โŠ ฮผโ„“)\mu_{\ell + 1} = (q_\ell + \sigma_{\ell+1}^2 \cdot)_* (\nu_\ell \boxtimes \mu_\ell)

NTK & Learning Rate

The spectrum (eigenvalues) of the NTK has a vital role in tuning the learning dynamics. e.g. ฮท>1/ฮปmax(ฮ˜)โŸน\eta > 1/ \lambda_\mathrm{max} (\Theta) \Longrightarrow The learning dynamics do not converge. e.g. The conditional number c=ฮปmin/ฮปmaxc = \lambda_\mathrm{min}/ \lambda_\mathrm{max} detemines the converge speed.

Redlineย  (the borderline of the exploding gradients) : This line is expected by our theory!


G. Yang expanded NTK to a more general one: maximal update parametrization (muP)!


4. Asymptotic Freeness


Asymptotic Freeness and Free Probability Theory

Definition(Asymtptotic freeness, Cโˆ—^*-version)[Voiculescuโ€™85] Let (Aj(n),Aj(n)โˆ—)jโˆˆJ(A_j(n), A_j(n)^*)_{j \in J} be a family of nร—nn \times n random matrices and adjoints. The family is said to be asymptotically free almost surelyif there exist Cโˆ—^*-probability spaces (Aj,ฯ„j)jโˆˆJ(\mathfrak{A}_j, \tau_j)_{j \in J} and elements (ajโˆˆAj)jโˆˆJ(a_j \in \mathfrak{A}_j)_{j \in J} so that for any QโˆˆCโŸจXj,Xjโˆ—โˆฃjโˆˆJโŸฉQ \in \mathbb{C} \langle X_j, X_j^* \mid j \in J \rangle, the following holds:

limโกnโ†’โˆžtrn[Q(Aj(n),Aj(n)โˆ—โˆฃjโˆˆJ)]=(โˆ—jโˆˆJฯ„j)[Q(aj,ajโˆ—โˆฃjโˆˆJ)],\lim_{n \to \infty} \mathrm{tr}_n \left[Q(A_j(n), A_j(n)^* \mid j \in J ) \right] \\ = (*_{j \in J} \tau_j) \left[ Q(a_j, a_j^* \mid j \in J) \right],

where โˆ—jโˆˆJฯ„j*_{j \in J} \tau_j is the free product of the tracial states.

Example

For NโˆˆN,N \in \N, let

  • W(N)W(N) be Ginibre or Haar orthogonal random matrix,
  • D(N)D(N) be a constant diagonal matrix with a limit distribution as Nโ†’โˆž.N \to \infty. Then (W,Wโˆ—)(W, W^*) andย DD are a.s. asymptotically free as N \to \infty.

Asymptotic Freeness of Jacobians

Let Wโ„“,Dโ„“(โ„“=1,2,โ€ฆ,L)W_\ell, D_{\ell} (\ell=1,2,\dots, L) be weight matrices in MLP and Wโ„“W_\ell be scaled Haar orthogonal random matrices. (The Gaussian case is treated by : [B. Hanin and M. Nica.], [L. Pastur. ], [G.Yang] ) Theorem [CH22] Assuming that D1,โ€ฆ,DLโˆ’1D_1, \dots, D_{L-1} have limit joint moments. Then

(W1,W1โŠค),โ€ฆ,(WL,WLโŠค),(D1,โ€ฆ,DLโˆ’1)(W_1, W_1^\top), \dots, (W_L, W_L^\top), (D_1, \dots, D_{L-1})

are asymptotically free as nโ†’โˆžn \to \inftyย  almost surely.

Difficulty: Entries of Dโ„“ย andย Wโ„“D_\ell \text{\ and }W_\ell are not independent. Dโ„“=diag(ฯ†โ€ฒ(hโ„“))D_\ell = \mathrm{diag}(\varphi^\prime(h_\ell)) hโ„“=Wโ„“xโ„“h_\ell = W_\ell x_\ell

(Sketch of Proof) Invariance of MLP + Taking submatrix Construct orthogonal matrix Uโ„“U_\ell fixing xโ„“x_{\ell} , i.e.

Uโ„“xโ„“=xโ„“,U_\ell x_\ell = x_\ell,

and

Uโ„“โˆฃRxโ„“โŠฅโˆผNโˆ’1ร—Nโˆ’1ย Haarย OrthogonalU_\ell |_{{\mathbb{R}x_\ell}^\bot} \sim N-1 \times N-1 \text{\ Haar Orthogonal}

with

(U0,โ€ฆ,Uโ„“)โŠฅโ€‰โฃโ€‰โฃโ€‰โฃโ€‰โฃโŠฅ(Wโ„“+1,โ€ฆ,WL).(U_0, \dots, U_\ell) \perp\!\!\!\!\perp(W_{\ell+1}, \dots, W_L).

For โ„“=0,โ€ฆ,Lโˆ’1.\ell=0,\dots, L-1. Then we only need to show the asymptotic freeness of Nโˆ’1ร—Nโˆ’1N-1 \times N-1 submatrices of

(U0W1,โ€ฆ,ULโˆ’1WL),(D1,โ€ฆ,DLโˆ’1).(U_0W_1, \dots, U_{L-1}W_L), (D_1, \dots, D_{L-1}).

5. Effective Expression of MLP-Mixer

Understanding Wideness in Practical Models!

Preliminaries

MLP

MLP (multilayer-perceptron) is a composition of transforms in the form of

y=ฯ•(Wx),xโˆˆRmy = \phi( Wx), x \in \R^m

where WW is a parameter matrix ( the transforms do not share the parameter matrices).

Static Mask: Consider a matrix MM of entries 0 or 1 and replace W by M \odot W:

y=ฯ•((MโŠ™W)x)y = \phi( (M \odot W) x )

MLP-Mixer

NeurIPS2021, Tolstikhin, et.al

The structure is less structured than Convolutional Neural Networks or Vision Transformers.

Blocks of MLP-Mixer:

Token-MLP(X)=W2ฯ•(W1X),Channel-MLP(X)=ฯ•(XW3)W4\text{Token-MLP}(X) = W_2 \phi(W_1 X), \quad \text{Channel-MLP}(X)= \phi( X W_3) W_4

where W1โˆˆRฮณSร—SW_1 \in \mathbb{R}^{\gamma S\times S}, W2โˆˆRSร—ฮณSW_2 \in \mathbb{R}^{S \times \gamma S}, W3โˆˆRCร—ฮณCW_3 \in \mathbb{R}^{C\times \gamma C} , W4โˆˆRฮณCร—CW_4 \in \mathbb{R}^{\gamma C\times C}.

Symmirarity between MLP-Mixer and MLP via vectorization

Vectorization and effective width

We represent the vectorization operation of the matrix Sร—CS \times C matrix XX by vec(X)\text{vec}(X); more precisely,

(vec(X))(jโˆ’1)d+i=Xij,(i=1,โ€ฆ,S,j=1,โ€ฆ,C).(\text{vec}(X))_{{ (j-1)d + i}} = X_{ij} , (i = 1, \dots, S, j= 1, \dots, C).

In other words, the map is the representation

vec:MS,C(R)โ†’L2(MS,C(R)),\mathrm{vec}: M_{S,C}(\R) \to L^2(M_{S,C}(\R)),

We also define an inverse operation to recover the matrix representation.
There exists a well-known equation for the vectorization operation and the tensor ( or Kronecker) product denoted by otimes\\otimes;

vec(WXV)=(VโŠคโŠ—W)vec(X),\text{vec}(W X V) = (V^\top \otimes W) \text{vec}(X),

for WโˆˆRSร—SW \in \mathbb{R}^{S \times S} and VโˆˆRCร—CV \in \mathbb{R}^{C \times C}.

As discussed later, the aforementioned equation corresponds to the vectorization of an MLP-Mixer block with a linear activation function.
The vectorization of the feature matrix WXVW X V is equivalent to a fully connected layer of width

m=SCm=SC

with a weight matrix VโŠคโŠ—WV^\top \otimes W. We refer to this mm as the *effective width *of mixing layers.

Under vectorization of feature matrices

Channel-Mixing layer is converted into :

vec(X)โ†ฆ(ICโŠ—W)vec(X),\mathrm{vec}(X) \mapsto (I_C \otimes W) \mathrm{vec}(X),

Token-Mixing layer is converted into:

vec(X)โ†ฆ(VโŠคโŠ—IS)vec(X),\mathrm{vec}(X) \mapsto (V^\top \otimes I_S) \mathrm{vec}(X),

In MLP-Mixer, when we treat each Sร—CS \times C feature matrix XX as an SCSC-dimensional vector vec(X)\mathrm{vec}(X), the right multiplication by an Cร—CC \times C weight VV and the left weight multiplication by a Sร—SS \times S weight WW are represented as \begin{align}

vec(X)โ†ฆ(ICโŠ—W)vec(X),ย ย vec(X)โ†ฆ(VโŠคโŠ—IS)vec(X).\mathrm{vec}(X) \mapsto (I_C \otimes W)\mathrm{vec}(X), \ \ \mathrm{vec}(X) \mapsto (V^\top \otimes I_S ) \mathrm{vec}(X).

This expression clarifies that the mixing layers work as an MLP with special weight matrices with the tensor product. As usual,

S,Cโˆผ102ย toย ย 103S, C \sim 10^2 \text{\ to \ }10^3

Mixer is equivalent to an extremely wide MLP

m=SC=104ย toย ย 106m= SC=10^4 \text{\ to \ } 10^6

Moreover, the ratio of non-zero entries in the weight matrix ICโŠ—WI_C \otimes W is 1/C1/C and that of VโŠคโŠ—ISV^\top \otimes I_S is 1/S1/S.

#non-zeroย entriesย inย ICโŠ—W=1/C\# \text{non-zero entries in } I_C \otimes W = 1/C #non-zeroย entriesย inย VโŠคโŠ—IS=1/S\# \text{non-zero entries in } V^\top \otimes I_S = 1/S

e.g. Block-matrix rep:

IcโŠ—W=(W0โ‹ฏ00Wโ‹ฏ0โ‹ฎโ‹ฑโ‹ฎ00โ‹ฏW)I_c \otimes W = \begin{pmatrix} W & 0 & \cdots & 0 \\ 0 & W & \cdots & 0 \\ \vdots & & \ddots & \vdots\\ 0 & 0 & \cdots & W \end{pmatrix}


Therefore, the weight of the effective MLP is highly sparse.

Commutation Matrix

Furthermore, to consider only the left multiplication of weights, we introduce commutation matrices:

A commutation matrix JCJ_C is defined as

Jcvec(X)=vec(XโŠค)J_c \mathrm{vec}(X) = \mathrm{vec}(X^\top)

where XX is an Sร—CS \times C matrix. Note that for any entry-wise function ฯ•\phi,

Jcฯ•(x)=ฯ•(Jcx),xโˆˆRmJ_c \phi (x) = \phi (J_c x), x \in \R^m

Note that

VโŠคโŠ—IS=JcโŠค(ISโŠ—V)Jc.V^\top \otimes I_S = J_c^\top (I_S \otimes V)J_c.

Effective Expression of MLP-Mixer: Channel-MLP Block:

u=ฯ•(Jc(ICโŠ—W2)ฯ•((ICโŠ—W1)x)),u= \phi (J_c (I_C \otimes W_2) \phi((I_C \otimes W_1) x)),\\

Token-MLP Block:

y=ฯ•(JcโŠค(ISโŠ—W4โŠค)ฯ•((ISโŠ—W3โŠค)u))y= \phi(J_c^\top ( I_S \otimes W^\top_4 ) \phi(( I_S \otimes W^\top_3 ) u) )

MLP with static-mask

Static Mask: Consider a matrix MM of entries 0 or 1 distributed and replace W in each layer of MLP by MโŠ™WM \odot W:

y=ฯ•((MโŠ™W)x)y = \phi( (M \odot W) x ) MโˆผBernoulli(p)M\sim \mathrm{Bernoulli}(p)
  • The mask matrix MM is fixed during the training.

Hidden features and test accuracy

To validate the similarity of networks in a robust and scalable way, we look at the similarity of hidden features of MLPs with sparse weights and MLP-Mixers based on the centered kernel alignment (CKA) ย Nguyen T., Raghu M, Kornblith S.

CKAminibatch(X,Y)=kโˆ’1โˆ‘iHSIC1(XiXiโŠค,YiYiโŠค)kโˆ’1โˆ‘iHSIC1(XiXiโŠค,XiXiโŠค)kโˆ’1โˆ‘iHSIC1(YiYiโŠค,YiYiโŠค)\text{CKA}_\text{minibatch}(X,Y) = { k^{-1}\sum_i \text{HSIC}_1(X_i X_i^\top, Y_iY_i^\top)\over \sqrt{k^{-1}\sum_i \text{HSIC}_1(X_iX_i^\top,X_iX_i^\top)}\sqrt{k^{-1}\sum_i \text{HSIC}_1(Y_iY_i^\top,Y_iY_i^\top)}}

In practice, we computed the mini-batch CKA [Section~3.1(2)](Ngueyen 2021) among features of trained networks.

Each experiment is done on CIFAR10with four random seeds. (a) Average of diagonal entries of CKA between trained MLP-Mixer (S=C=64,32S=C=64,32) and MLP with a static mask with different sparsity pp (= ratio of 1 in entries of masks). Sparser MLP was similar to (b) CKA between MLP-Mixer (S=C=64S=C=64) and MLP with the corresponding sparsity 1/641/64, and (c) CKA between the Mixer and a dense MLP. (d) Test accuracy of MLPs with sparse weights and MLP-Mixers with different widths under ฮฉ=219\Omega=2^{19}.

6. Alternative to MLP with static masks