From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

Hajime Yoshino

SciPost Submission Page

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

by Hajime Yoshino

This Submission thread is now published as

SciPost Phys. Core 2, 005 (2020)

Submission summary

Authors (as registered SciPost users):

Hajime Yoshino

Submission information
Preprint Link:	https://arxiv.org/abs/1910.09918v3 (pdf)
Date accepted:	2020-04-07
Date submitted:	2020-03-20 01:00
Submitted by:	Yoshino, Hajime
Submitted to:	SciPost Physics

Ontological classification
Academic field:	Physics
Specialties:	Artificial Intelligence Statistical and Soft Matter Physics
Approaches:	Theoretical, Computational

Abstract

We develop a statistical mechanical approach based on the replica method to study the design space of deep and wide neural networks constrained to meet a large number of training data. Specifically, we analyze the configuration space of the synaptic weights and neurons in the hidden layers in a simple feed-forward perceptron network for two scenarios: a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints,~i.e. increasing the number of training data, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk with the thickness of the solid phase growing logarithmically with the data size. This implies the typical storage capacity of the network grows exponentially fast with the depth. In a deep enough network, the central part remains in the liquid phase. We argue that in systems of finite width N, the weak bias field can remain in the center and plays the role of a symmetry-breaking field that connects the opposite sides of the system. The successive glass transitions bring about a hierarchical free-energy landscape with ultrametricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler ones in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally, we present some numerical simulations of learning which reveal spatially heterogeneous glassy dynamics truncated by a finite width $N$ effect.

Author comments upon resubmission

I would like to thank the referees for their great efforts and patience to carefully read the 2nd version of the manuscript. The comments are very much useful. The following are my responses. I also put the list of changes in the end.

(Note) In the following, [A].. is my response to the comments. In my responses, I refer to the equation numbers Eq. (...), Figure numbers, section numbers and page numbers of those in the revised version.

report 1

[Strengths]

The theoretical computation is remarkable and, as far as I can tell, novel
Potentially very interesting and useful results
Well organized and written

[A] Thank you very much for the very positive response.

[Weaknesses]

I believe that the results are approximate (as was actually claimed in the previous version) rather than exact/quasi-annealed (as claimed in this version). Given that, I'll report my previous comment: The approximation used to derive the results might prove too crude (it's hard to tell). I have some strong doubts in particular about the teacher-student case.

[A] I realized that the theory is a truly mean-field 'approximation' for the problem: a tree-approximation is made for the inter-layer fluctuations. Now it is clear that the 'left-right symmetry is an accidental symmetry due to the tree-approximation. Loop corrections will remove this. Although how other aspects of the results change by taking into account loop corrections is an open question, I believe other essential features remain.
The setting of some of the numerical experiments (in the teacher-student scenario) are rather far from the theoretical analysis. Additionally, when the setting is pushed a little more in the direction of the theory (increasing N), the results go in an opposite direction than the theory suggests (generalization capability drops).

[A] I agree with the referee. I decided to remove the presentation of the numerical simulation on teacher-student setting from the present paper leaving it for future works.

[Report]

The author has amended and expanded the paper and in doing so he has fixed most of the issues that I had raised in my previous review.

The new version however also includes now some stronger claims with respect to the previous one, as a result of a new computation (eq. 71) which shows that some terms of order higher than 2 become negligible in the bulk of the network. Based on this result, the author has removed the previous description of the computation as a "Gaussian approximation", claiming instead that the result is an "exact replica calculation" except for the boundaries where the calculation is not quenched but "slightly annealed". Since some features of the results would be surprising (mainly, the symmetry of the solution to an asymmetric model), the author presents some qualitative arguments and some analogies with other systems (e.g. hard spheres) in support to this claim.

I'm not convinced that this claim is correct. In brief, the procedure detailed in section A.1.4 called "Plefka expansion" is only exact in the limit of weak interactions (small $\lambda$), otherwise eq. 53 does not hold. But in that section it is written: "The parameter $\lambda$, which is introduced to organize a perturbation theory, is put back to $\lambda=1$ in the end." This is a weak-interactions approximation. It seems to me that this procedure is crucial to derive the results of the paper, in particular eq. 62, which allows to derive eqs. 63-64 which in turn allow to perform the necessary simplifications in eq. 71, which is the basis of the new, stronger claim. Indeed, the author writes that eq. 62 corresponds to eq. 52, which is valid in absence of interactions.

(I should say at this point that I have some additional comments about the clarity of the derivation in the appendix; it's possible that I misunderstood something, see below.)

[A] I understand the concern of the referee. As I noted above, the theory is actually a tree-approximation that neglects some loop-corrections needed to describe the one-dimensional inter-layer fluctuations. In the revised text, I explain this point in detail.

From the qualitative point of view, the author argues that the surprising symmetry of the result stems from the infinite $N$ limit, but I fail to see how that would happen here. For example, one argument is that "The system is locally asymmetric (at the scale of each layer) but globally homogeneous except at the boundaries." but I don't see this supposed homogeneity; the expression is very clearly still asymmetric. I don't dispute that it may happen that an asymmetry like this might become irrelevant in some limit for some models, I'm just not convinced that this is the case here, at least not exactly. The "entropic argument" given at p.14 is not unreasonable, but it's qualitative and an a-posteriori explanation; certainly it's not sufficient by itself to justify exact symmetry. (On the other hand, I still think that the results are relevant and probably qualitatively valid, and if so that the entropic argument gives a plausible mechanism).

[A] I fully understand the concern of the referee. As I noted above the left-right symmetry is just an accidental one at the level of tree-approximation. Apparently, loop-corrections should induce asymmetry especially close to the boundaries where more microscopic information is needed for accurate descriptions. This is explained in the new version.

The generalization result for the teacher-student scenario is still the most problematic in my opinion, as I still have a very hard time believing that near-perfect generalization is possible with a liquid phase in the middle of the network, just from "combinatorics-like" arguments (different networks should correspond to different N-bits-to-N-bits mappings, once permutation symmetry is accounted for). I had previously argued that, if one seeks an explicitly symmetric solution, imposing an overlap 1 at the initial end would obviously result in an overlap 1 at the other end. Now the author argues, from eq. 71, that the symmetry in the solution emerges spontaneously in the large $N$ limit rather than being enforced a priori from an approximation. Since as I wrote above I'm not convinced that that's the case, I maintain my previous opinion. One way to check if I'm right could be to test the equations at smaller values of $\alpha$: it the output overlap is always 1, implying perfect generalization for any size of the training set, which would be clearly absurd (one expects to see something like the single teacher-student perceptron with continuous weights, where the error decreases with $\alpha$), then the result is an artifact of the approximation.

[A] I fully understand this concern. In the revised version I keep the presentation of the replica analysis on training but revised extensively the discussions. The main problem is how to replace the fictitious symmetry breaking field used in the theory by some real ones. In the revised version I present a discussion on 'unlearning' (instead of learning).

Relatedly, the results of the simulations in the teacher-student case use extremely small values of $N$ and are thoroughly unconvincing. I have reimplemented the procedure described in the paper, and with the same parameters $(N=4, L=6, \alpha=1,2,4)$, I obtain precisely the same results for the generalization overlap that are reported in fig. 17, panel c, right plot. However, increasing $N$ makes all the overlaps drop drastically. As an example of parameter settings that I could test quickly, for $L=10$ and $\alpha=1$, I obtain the following values for $N=4,8,12: 0.65, 0.21,0.09$. I see similar results with $L=20$. I strongly suspect that the limit is $0$ for large $N$. In fig. 19, where some analysis on the dependence of the results with $N$ is shown, the overlap for the output layer (the one which actually correlates with generalization) is not shown in panel c, as far as I can tell. I wonder why. In any case, the values of $N$ used are 3,4,5 which makes it really hard to observe any effect.

If I'm right, I think that the whole teacher-student part should be either heavily revisited or left out entirely for a more in-depth study.

I have some additional issues that I'll leave for the "Requested changes" section.

[A] I truly thank the referee for doing independent simulations and also for the comment 2020-02-28. I agree that the effect becomes weaker everywhere including the output as I reported in the comments in my comments 2020-02-28 and 2020-03-01. I decided to remove this part from the present paper leaving it for future, more extensive numerical works.

[Requested changes]

The big ones, if I'm right, are to:

Revert the description of the results as approximate;

[A] Yes. Done.
Revisit entirely the teacher-student generalization issue and corresponding simulations (possibly, removing the whole section and leaving it for a follow-up more in-depth study, ideally accompanied by convincing numerical evidence).

[A] I revised extensively the theoretical part on the teacher-student setting and remove the numerical part.

Apart from that, the following notes are mostly about typos or very minor issues, but there are a few more serious issues too.
p.10 eq. 12: I think there has been a mix-up here, the $\eta$ integral doesn't belong here, it's introduced by the Fourier transform in eq. 38

[A] Right. I fixed this.
p.12: "Maybe one would not easily expect such asymmetry in the feed-forward network" I think the author meant "symmetry" here.

[A] This sentence is removed.
p.27 eq. 34: The update move was chosen in such a way as to ensure that on average the spherical constraint on the weights is satisfied. However, due to the random-walk nature of the update, the averages may "diffuse" over time. This can have an effect on the autocorrelations, eq. 35, especially since $N=20$ is being used which is fairly small and might not allow to neglect effects of order $\sqrt{N}$. It might be necessary when computing the autocorrelations to rescale by a factor $∥J(t)∥$ (I'm assuming $J(0)$ is normalized, otherwise it should be too). A quick check should be sufficient to detect whether this is an issue or not.

[A] The normalization (rescaling) is always done. I also checked the effect of "diffusion" as suggested by referee and confirmed that it does not affect the results.
p.28: "Note that the behavior of the system is not symmetric to the exchange of input and output sides. We expect that this is a finite $N$ effect which disappears in $N \to \infty$ limit." I don't expect that, see comments in Report. Providing evidence with varying $N$ might be a good idea here.

[A] I changed this. The symmetry in the theory is an accidental one due to the tree-approximation as noted above.
p.29: It seems like the same $M$ was used for training and testing, but there is no reason for limiting the amounts of testing patterns: the larger the $M$ used for testing, the more accurate is the measurement.

[A] I agree. I will do so in the future. For the present paper I remove this simulation.
p.39-43, figs. 15-19: The figures are very far from the portion of the text where the simulation results are presented and discussed, making it rather hard to follow. The caption of fig. 17 at p. 41 goes beyond the end of the page and it's partially unreadable (on the other hand reducing the figure size would make it hardly readable because of poor resolution, so the author should consider finding a way to improve the figure resolution).

[A] I revised the figures improving the resolution. For Fig 15 (which is Fig 16 in the new version), I removed the three panels on the right hand side. I created a new figure Fig. 17 for the plots with logarithmic time scales where I show finite width $N$ effects.
p.44 eq.38: There should be an $a$ index in the products of the $J$ and $S$ traces in the second line. Also is $\xi_{\mu \nu}$ a Kronecker delta symbol? Otherwise I can't make sense of this expression and the sudden appearance of another pattern index $\nu$. (And even if my interpretation is correct, the purpose of doing this is very unclear anyway.)

[A] Fixed.
p.44 (below eq. 39): "free-energy" repeated twice.

[A] Fixed.
p.44: "excplitely" -> "explicitly"

[A] Done.
p.46 eq.53: [Mentioned above] This equation is only valid for small $\lambda$. If then $\lambda$ is "put back to 1", the result is an approximation.

[A] For some systems (mean-field models) the expansion stops exactly at $O(\lambda)$ and one obtain exact results. However, in the present paper, we have to invoke the tree-approximation to stop at $O(\lambda)$. I put these remarks below (55) in the new version.
p.47 eq. 55: The $\epsilon$ and $\epsilon$ terms should multiply the second term inside the exponent too (the sums)

[A] Fixed.
p.47: "assuming $c \gg 1$" this is a leftover from the previous version, it should be $N$.

[A] Done.
p.47 eq. 59: There are some superscripts $\mu$ which don't make sense there, I guess they should be $i$ (and to follow previous formulas they should go inside parentheses, with the replica indices outside).

[A] Fixed.
p.47 eq. 61: The replica and pattern indices are swapped compared to the previous formulas.

[A] Fixed.
p.47: The expressions in eqs. 56-61 are not clear, one has to guess a few things in order to follow. Mainly, that the dots on the l.h.s. of 56,57 seem to refer to some expression that can be factorized over the indices $i$ or $\mu$ (depending on the formula) and that the corresponding dots on the right hand sides are the factorized expressions (which however still depend on the index $i/\mu$?).

[A] Yes. I put an explanation below Eq. (59).

Also since the $\blacksquare$ index has disappeared it seems like the assumption is that the dotted expressions do not depend on those.

[A] I put $\blacksquare$ in Eq. (58),(59).

Moreover the index $i$ was dropped from the expressions in eqs. 58-59, but there is a spherical constraint on the $J$s (eq. 65) so that the trace over $J^{c}$ is actually not well defined here and the reader has to guess what's actually going on. At least defining the trace operators like in ref. [20] might help a bit.

[A] I put an integral expression of the spherical constraint in (10). Then below (60),(61) I explain that I included $\epsilon_{aa}$ in (60),(61) to include the integral to induce the spherical constraint.

Additionally, the expression in the last line of 38 is not actually factorized over the $\blacksquare$, of course, so clarifying the assumptions used here (and how they relate to the $0$-th order expansion of the previous section) is quite important, I think. That expression becomes "tractable" after the expansion of eq. 71, but then only because they use the simplifications allowed by eqs. 63-64, which were derived in the interaction-free assumption...

[A] I understand. I think it is now better.
p.47 eq. 62: "where $\epsilon^{∗}_{ab}=[...]$ are determined by": again, it should be clarified here that these expressions come from the non-interacting assumption. In terms of the previous expression, it is like neglecting the dotted expressions, right? Otherwise the saddle points on the eplisons would involve those as well. This is basically said in the following paragraph when it's pointed out that they correspond to eq. 52, but I think that making it more explicit before the expressions would clarify the derivation considerably.

[A] I understand. I put the explanation just below Eq. (64).
p.48 eq. 65: $c \to N$

[A] Right. I removed this and added a comment on the spherical constraint below Eq. ( 61).
p.48 eq. 67: The $S$ should be lowercase.

[A] Fixed.
p.48 in the unnumbered equation between 68 and 69: The $c$ should be a superscript of $S$ instead of a subscript.

[A] Fixed.

Also $\epsilon_{aa}$ was never defined before, and it uses the wrong character ($\epsilon$ instead of $\varepsilon$); probably just writing that it is assumed to be 0 beforehand would be enough.

[A] Fixed. $\epsilon_{aa}=0$ is put below Eq. (63).
p.49: This section should probably refer to eq. 38 (the replicated one) rather then eq. 12

[A] Done.
p.50 eq.72: There's a factor 2 missing in front of the cosh resulting in a missing constant term in the following two lines; it's irrelevant, but still...

[A] Right. Anyway this small section A.3.2.is not needed anymore so I rmoved it. Within the tree-approximation the boundary condition just amount to specify $q(0){ab}$ and $q(L)$.
p.50: [This one is more important than the others] The difficulty arising when the $J$ are considered quenched is mentioned. Isn't this precisely the teacher-student situation (at least for what concerns the teacher)?

[A] Right. Anyway within the tree-approximation averaging over $J_{ij}$ is not a problem. The section A 3.2 is removed.
p.51 and following: In all expressions with $S_{\rm ent}$, the $S$ should be lowercase for consistency with the previous sections.

[A] Done.
p.57 eq.115: a closing parenthesis is missing

[A] Fixed.
p.57 eq.123: the prime sign after the parenthesis in the numerator is a peculiar notation that should be defined.

[A] I chaaged it to $d/dh$ operator.

referee 2

[Report]

The author has made an effort to answer to my comments and the ones of the other referee.

[A] I thank the referee for the positive evaluation on the revisions.

I still disagree with the author about the scaling of the limit of capacity of the network and though at this point I do not consider this as an obstacle to publication, I invite the author to double thinking and to look to recent and old literature about this point.

[A] I understand that there might be other possible scalings. One difficulty is that I cannot find relevant literatures on the capacity of DNNs. I put a remark in the text that there are some other possible scalings which involve $N$ differently in some two-layer problems in the revised version (below Eq. (5)). On the other hand, the numerical simulation on the learning dynamics suggests that taking $N \to \infty$ limit with fixed $\alpha=N/M$ is indeed reasonable (see Fig. 17, which is added in the new version).

I personally find the notation in which factor nodes are denoted by a black square (instead of a letter) very heavy but I leave to the author the choice to change or to keep it.

[A] I understand the notation is a bit unusual. But there are simply too many indiciess in this problem so I decided to use also the graphical symbol which looks like a label for a factor node.

List of changes

1. Introduction

2nd paragraph, 2nd sentence is introduced to mention that we consider the global coupling between layers.
In the same paragraph, I mention that the present problem is a $1+\infty$ dimensional one.

7th paragraph, in the last sentence, I mention and briefly explain the tree-approximation.

2. Model

In the paragraph below Eq. (11), I mention that in DNN homogeneity is expected in the bulk but inhomogeneity is
expected close to the boundaries.

Fig 2 is added to explain the loop correction. In the paragraph below it, I explain that within the tree-approximation
the inhomogeneity close to the boundaries cannot be taken into account accurately.

3.5 Teacher-student setting

The discussion is revised extensively. The last 3 paraphs are new.

4. Simulation of learning

The presentation of the teacher-student setting is removed.

The text for the presentation on the random input/output setting is revised adding discussions on finite width $N$ effects
and crossover from glassy dynamics to rapid decay.

Fig. 10 is revised removing some figures and increasing the resolution.

A new figure Fig. 17 is added to explain finite width $N$ effects.

Fig 18 is also revised adding data with various width $N$.

I noticed that the quantities presented here are the overlap between two (real) replicas instead of time auto-correlation functions (the behavior is very similar). I corrected the text concerning this.

5. Conclusion and outlook

3rd paragraph, It is mentioned again that the tree-approximation is a weak point of our theory.

4th paragraph. It is emphasized that $\alpha=M/N$ scaling is supported by the numerical simulation.

Appendix

sec A.1.4. The last paragraph is added to mention some exact results and the limitation of the present work due to the
tree approximation.

sec A.3. It is explained where the tree-approximation is needed.

Published as SciPost Phys. Core 2, 005 (2020)

SciPost Submission Page

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

by Hajime Yoshino

This Submission thread is now published as

Submission summary

Abstract

Author comments upon resubmission

report 1

referee 2

List of changes

Login to report or comment