SciPost Submission Page
Learning a Restricted Boltzmann Machine using biased Monte Carlo sampling
by Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, Beatriz Seoane
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Aurélien Decelle 
Submission information  

Preprint Link:  https://arxiv.org/abs/2206.01310v2 (pdf) 
Code repository:  https://github.com/nbereux/TMCRBM 
Date accepted:  20221027 
Date submitted:  20221006 11:00 
Submitted by:  Decelle, Aurélien 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approach:  Computational 
Abstract
Restricted Boltzmann Machines are simple and powerful generative models that can encode any complex dataset. Despite all their advantages, in practice the trainings are often unstable and it is difficult to assess their quality because the dynamics are affected by extremely slow time dependencies. This situation becomes critical when dealing with lowdimensional clustered datasets, where the time required to sample ergodically the trained models becomes computationally prohibitive. In this work, we show that this divergence of Monte Carlo mixing times is related to a phenomenon of phase coexistence, similar to that which occurs in physics near a firstorder phase transition. We show that sampling the equilibrium distribution using the Markov chain Monte Carlo method can be dramatically accelerated when using biased sampling techniques, in particular the Tethered Monte Carlo (TMC) method. This sampling technique efficiently solves the problem of evaluating the quality of a given trained model and generating new samples in a reasonable amount of time. Moreover, we show that this sampling technique can also be used to improve the computation of the loglikelihood gradient during training, leading to dramatic improvements in training RBMs with artificial clustered datasets. On real lowdimensional datasets, this new training method fits RBM models with significantly faster relaxation dynamics than those obtained with standard PCD recipes. We also show that TMC sampling can be used to recover the freeenergy profile of the RBM. This proves to be extremely useful to compute the probability distribution of a given model and to improve the generation of new decorrelated samples in slow PCDtrained models. The main limitations of this method are, first, the restriction to effective lowdimensional datasets and, second, the fact that the Tethered MC method breaks the possibility of performing parallel alternative Monte Carlo updates, which limits the size of the systems we can consider in practice.
Author comments upon resubmission
Dear Editor,
Please find below the revision of our manuscript "Learning a Restricted Boltzmann Machine using biased Monte Carlo sampling", which we hereby submit for a second review. We would like to thank the reviewers for their careful and constructive reading of the previous version of the manuscript, and for their positive feedback and numerous comments and suggestions. In the following, we address all their comments and suggestions point by point.
Comments of the first referee
 It is mentioned that MNIST images are qualitatively better  could the author back this claim by including such a figure?
Answer: This claim is supported by Figure 9B, which compares the histograms of the projections of the data generated by the two machines with the histogram of the original data set. The histograms of the samples generated by the TMCRBMs are more similar to the data set than those generated by the SMCRBMs. As for the actual images, when examined visually, it is hard to say. Both machines produce good looking digits, only the SMC RBM tend to produce a greater imbalance between 0s and 1s and less variability. We have included several images of the equilibrium configurations obtained with each machine in Fig.9A.
 Do they authors understand why the TMC trained RBMs seem to have shorter relaxation times for SMC algorithms?
Answer: We have at least an empirical explanation. In a previous work [1], we observed that RBMs trained with nonconvergent MC chains to estimate the negative term of the gradient operate in the socalled nonequilibrium regime. These machines learn to encode the dynamical process they were trained with, rather than reproducing the statistics of the data set in their equilibrium measure. These machines with memory tend to have extremely slow dynamics because they are optimized to produce "good samples\" at some stage before reaching equilibrium. We further observed that RBMs trained with Rdm and some few steps ($k \sim 10$), produced RBMs with extremely slow dynamics, such as never ending aging effects. However, RBMs trained with PCD are more difficult to analyze because, first, the chain initialization statistics for the generation samples cannot be easily retrieved, and second, the effective number of Gibbs steps increases during training as the parameters change more slowly as training progresses. PCD RBM tends to favor equilibrium regimes but eventually falls out of equilibrium if the mixing times diverge, as is the case with the very clustered datasets. This means that outofequilibrium phenomenology is expected at times greater than the effective number of sampling steps of the persistent chain during training, which can be very long if the RBM has been trained for a very long time. Recent work has shown that upon training, RBMs eventually enter a degradation regime in which the mixing times of the model begin to diverge [2]. This degradation regime appears to be analogous to what we observed when we forced nonequilibrium sampling for too long during training. This degradation regime is then expected to arrive later in TMCRBMs because one can guarantee equilibrium for longer epochs. We have included this discussion in the new version of the manuscript.
 In particular for MNIST, do the author understand why the TMC trained RBMs is well sampled by an SMC? Does this suggest the MNIST 01 modes are actually not that separated? Would that also be the reason why it is usually considered that MNIST can be learned with CD or PCD by an RBM (as the authors demonstrate that the multimodality a priori prohibits any learning with traditional MCMCs for clustered datasets)?
Answer: Again, we rely mainly on intuition and empirical facts. Our intuition is that MNIST can be decently trained because, at least in the PCA space, all the images are somehow connected (as compared with the previous artificial datasets were clusters were completely separated already at the level of the PCA). The MNIST 01 is somewhat different, where only the first direction clearly separates the two types of digits. Nevertheless, this MNIST 01 data cannot be reduced to just one or two dimensions. In the first stage of learning, when the RBM has learned only the first direction, the classical SMC cannot correctly sample the model (as we show in Fig.8B). As more directions are learned, it is possible that the SMC will use other routes with lower barriers to switch between the different digits to explore the entire phase space.
Minor comments:
 There seems to be a notation change at the bottom of page (8) where the hidden variables are denoted by $\tau$.
Answer: Thank you for noticing it. Now it's corrected.
 Also, what does “permanent chain” refer to? Is it the “persistent chain”of PCD?
Answer: We agree with the referee, we changed permanent to persistent in the new version to avoid confusion.
 Maybe a terminology slightly confusing is the employed “lowdimensional datasets”, “6.1.1. onedimensional dataset”. Actually it is not entirely clear whether the considered datasets have actually lowdimensional subspace support embedded in high dimension, or whether the only assumption is that a projection onto a lowdimensional subspace is enough to differentiate the clusters. Can the authors clarify?
Answer: The two artificial datasets are created along a line and a plane, respectively. In this sense, we mean that they are lowdimensional. However, due to their inherent noise, they exhibit fluctuations in the other dimension as well. We have added a sentence in the manuscript to make this clearer.
 In figure 5B, why are there two red lines?
Answer: There were other eigenvalues on the figure that do not grow during learning and therefore appear as a straight line. We have now pointed out in the caption that the first 10 eigenvalues are shown.
 In figure 5C, what does the black points represent? The dataset?
Answer: They correspond to the dataset projected in the $m_0m_1$ space. We have now added a sentence in the caption of the figure clarifying this.
 Taking only the 01 in MNIST should yield about 10,000 images rather than 10ˆ5 (page 16).
Answer: Indeed, that was a mistake on our part. We have corrected the amount in the new version.
 Unless I missed something, it would be useful to describe more the human genome dataset somewhere in the paper. Are the data binary? In which dimension? Is there one variable per gene?
Answer: We agree with the referee that the description of the dataset was too brief. A new paragraph about this dataset has been amended at the beginning of the section to be more precise.
Comments of the second referee
 Please provide runtime comparisons for TMC against SMC.
Answer: We have added a new section "Runtime Comparison" to discuss the runtime of different methods on different architectures.
 Please state limitations of the proposed approach in the abstract
Answer: We have added a sentence discussing the limitations in the new abstract.
 Figure 6D: it seems that the training has not converged after 2K updates. Please show more training points to validate convergence or discuss potential issues.
Answer: The fullconvergence of RBMs is a complicated matter and eigenvalues do not generally converge. Empirically one finds that eigenvalues grow (slowly) forever unless certain regularization is introduced (or just one eigenvalue is necessary as in Fig.5A). In fact, it has been shown in [3] that the gradient with respect to the rotation of the eigenvectors $u^\alpha$ and $v^\alpha$ are quite unstable when two eigenmodes $w^\alpha$ and $w^\beta$ are getting closer. In the experiment corresponding to fig. 6D, we can first observe that the evolution is quite slow (the xaxis is in logscale) and the slowincreasing trend that we see is probability due to a mix of instability due to various reasons: the form of the gradient but also the noise due to the approximation when computing the negative term.
 Please define in the paper the integrated autocorrelation time $\tau_{\rm int}$ rather than referring the reader to a 500 pages text book.
Answer: We have added to the definition in the manuscript and included a new reference to lecture notes on Monte Carlo sampling.
 Typo p.16: “just as we discussed in the contest of the artificial 1D dataset.”
Answer: Corrected.
 Labels of Figure 9A seem swapped. Please check.
Answer: Thank you for your comment, this error has gone unnoticed. The colors and caption in the figure are now corrected.
 Spelling p19: “ minima, and that in addition, even if one can “cherry peak" the exact number of updates where”
Answer: Corrected.
One suggestion:
Did the authors experiment on tethering one or few hidden units rather than the principal components ? Sampling from the conditional $P(v,h  \hat{m} )$ would be substantially easier as conditional independence would be preserved. Moreover, hidden units may play a similar role as principal components, as they learn the main collective modes of variation of data and thus have slow dynamics. It should be possible to dynamically pick the tethered hidden units using heuristics such as weight norm / hidden unit importance or based on the autocorrelation function of their marginal. If two or three hidden units are chosen, they should be uncorrelated to maximize diversity.
Answer: We thank the referee for the suggestion. It would be computationally convenient to tether just one hidden node, but it is not clear a priori which particular variable to choose. In addition, the RBM "suffers" from permutation symmetry of the hidden nodes, which makes it particularly difficult to make a guess about the index of the hidden node that might encode the relevant information without knowing in advance the relevant feature describing the data. Although the referee provides good suggestions about how to select the hidden nodes, we believe that this would be a significant change to the method, since it would mean adapting the code for such a task (which implies tuning the metaparameters) and applying it to the paper's datasets. In any case, we add a sentence in the perspectives for future work.
References:
[1] Aurélien Decelle, Cyril Furtlehner, and Beatriz Seoane. Equilibrium and nonequilibrium regimes in the learning of restricted boltzmann machines. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
[2] Lennart Dabelow and Masahito Ueda. Three learning stages and accuracy–efficiency tradeoff of restricted boltzmann machines. Nature communications, 13(1):1–11, 2022.
[3] Aurélien Decelle, Giancarlo Fissore, and Cyril Furtlehner. Thermodynamics of restricted boltzmann machines and related learning dynamics. Journal of Statistical Physics, 172(6):1576–1608, 2018
List of changes
### **List of changes:**
All major and minor comments of the referees have been taken into account.
In particular, we added
 a new section to compare the runtime of the different methods, including a new figure.
 a panel to fig. 9 to compare qualitatively the results.
Finally, the introduction and abstract have been polished.
Published as SciPost Phys. 14, 032 (2023)