SciPost Submission Page
The edge of chaos: quantum field theory and deep neural networks
by Kevin T. Grosvenor and Ro Jefferson
This Submission thread is now published as
|Authors (as Contributors):||Ro Jefferson|
|Date submitted:||2022-01-03 21:22|
|Submitted by:||Jefferson, Ro|
|Submitted to:||SciPost Physics|
We explicitly construct the quantum field theory corresponding to a general class of deep neural networks encompassing both recurrent and feedforward architectures. We first consider the mean-field theory (MFT) obtained as the leading saddlepoint in the action, and derive the condition for criticality via the largest Lyapunov exponent. We then compute the loop corrections to the correlation function in a perturbative expansion in the ratio of depth T to width N, and find a precise analogy with the well-studied O(N) vector model, in which the variance of the weight initializations plays the role of the 't Hooft coupling. In particular, we compute both the O(1) corrections quantifying fluctuations from typicality in the ensemble of networks, and the subleading O(T/N) corrections due to finite-width effects. These provide corrections to the correlation length that controls the depth to which information can propagate through the network, and thereby sets the scale at which such networks are trainable by gradient descent. Our analysis provides a first-principles approach to the rapidly emerging NN-QFT correspondence, and opens several interesting avenues to the study of criticality in deep neural networks.
Published as SciPost Phys. 12, 081 (2022)
Author comments upon resubmission
We thank both referees for their careful reviews, and for raising several clarifying points which we have addressed below.
Response to Report 1 by Dr. Bryan Zaldivar:
1A. We see how these two statements could appear in tension, but they are not actually in contradiction: while there may exist networks (that is, a particular state or set of states for a given infinite-width network) that can approximate any given function (per the referenced theorem), there are many more states that will not. The statement about non-evolving representations is essentially that if you don’t happen to start with the correct set of weights and biases that approximates the given function, you’ll never evolve to it. In other words, the referenced theorem is an abstract statement that infinite-width networks are universal approximators, but says nothing about the learning dynamics in practice; the statement of  is that these dynamics are in fact trivial (specifically, that the neural tangent kernel does not evolve). This issue is first mentioned on page 8 of , and discussed in more detail in sections 6.3.3, 10.1.2, and chapter 11 therein, as well as in the newly added ref. . We have elaborated on this in the introduction in the hopes of resolving any confusion.
1B. In the context of the previous point, we are merely making the observation that the distributions will be Gaussian at infinite width, which follows directly from the central limit theorem (CLT) regardless of the initial (potentially non-Gaussian) choice of priors from which the parameters are drawn. We have added an explicit citation to the original thesis by Neal (previously included only implicitly via ref. ), as requested.
2. The exercise suggested by the reviewer is indeed sensible, but in fact has already been done in [17,18]. We have alluded to and cited this discrepancy between the infinite-width prediction and the empirical results at finite N in the introduction (around page 3), as well as at the beginning of section 4 (page 30), and the end of section 4.3.3 (page 50). As the reviewer points out however, this is a central point of our work, so we have added a new paragraph in the Discussion (page 61) where we discuss this in more detail, and explicitly acknowledge this limitation of the perturbative approach. We also agree with the reviewer that the remark at the end of section 4.3 about a possible shift in the location of the critical point appearing at higher orders is not intuitive, and have removed this sentence, instead mentioning this possibility in the context of the aforementioned new paragraph.
3. While the reviewer is certainly correct that complicated dependencies may arise under training, here we are concerned with networks at initialization; we have edited the sentence above (2.15) to more clearly reflect this. In principle, one could consider a multivariate Gaussian with non-vanishing covariances between different parameters, but this would lead to an intractable increase in the number of couplings and a corresponding explosion of possible Feynman diagrams; e.g., if one coupled only two parameters, one would have a bivariate measure as in (2.46), resulting in a new coupling. With 5 parameters, there would be 26 possible couplings.
4. We thank the reviewer for this excellent suggestion to improve the manuscript, and have added a new appendix (now appendix A) enumerating the various elements of the NN-QFT dictionary. We added sentences referring the reader to this dictionary in the Introduction as well as the beginning of the Discussion.
Response to Report 2 by Dr. Harold Erbin:
Here let us first respond to the general points under "weaknesses":
As the referee points out under "strengths", we have gone to great lengths to present each step of the construction as clearly as possible. While the result is a relatively long and detailed paper, we believe this is well-worth it for the pedagogical clarity and explicitness this achieves (e.g., for the purposes of future work). Nonetheless, in the context of the next point, we agree that it is not easy for the reader to keep track of various technical assumptions in relation to the big picture, and have addressed this in our detailed response to requested changes in the attachment (see in particular points 1 and 12). While the notion of balance is inherently subjective, we believe the result is improved.
We aspired to be as general as possible for as long as possible, which is why simplifying assumptions are introduced en route, rather than restricting ourselves to some less-specific class of models at the outset. We agree with the referee that some discussion of these assumptions in relation to the generality of the work would be an improvement, and have added a new section (appendix A.1) in which we treat each of these in detail. See also our response to this point in aforementioned attachment.
We certainly agree that this is an important next step, as we have discussed in section 5. As the referee points out however, this topic is in its infancy, and the purpose of this work is explicitly theoretical in aim, namely the construction of a direct correspondence between deep neural networks and quantum field theory. We hope to see (if not perform ourselves) thorough empirical explorations of this topic in the future, but such an investigation is beyond the scope of this initial work. We would also like to point out that the complementary approach  also presented no numerical tests in support of their derivations, being similarly focused on theoretical explorations. In the preface of , the authors claim that they have performed these tests privately, but have chosen not to show them for reasons explained therein; here, we have simply been explicit about the need for empirical tests in future work.
See previous point. However, we have mentioned the main practical benefit – namely, predicting the location of the critical point – in several places, including the introduction, section 4, and section 5 (where some directions for future work are also discussed). At a more general level however, our main goal is to further the fundamental theory of deep neural networks by leveraging powerful tools from theoretical physics, especially QFT, in the spirit of previous works we have cited in the NN-QFT correspondence.
We hope that you will kindly consider the resubmitted manuscript for publication in SciPost.
Sincerely yours, K. Grosvenor and R. Jefferson
List of changes
Please see the pdf attached with our reply for a detailed response to the feedback and suggestions provided in Dr. Erbin's attachment. For convenience, we have copy-pasted the original text of the latter to make the document self-contained.
Submission & Refereeing History
You are currently on this page