SciPost Submission Page
Learning a Restricted Boltzmann Machine using biased Monte Carlo sampling
by Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, Beatriz Seoane
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Aurélien Decelle 
Submission information  

Preprint Link:  https://arxiv.org/abs/2206.01310v1 (pdf) 
Date submitted:  20220606 13:08 
Submitted by:  Decelle, Aurélien 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approach:  Computational 
Abstract
Restricted Boltzmann Machines are simple and powerful generative models capable of encoding any complex dataset. Despite all their advantages, in practice, trainings are often unstable, and it is hard to assess their quality because dynamics are hampered by extremely slow timedependencies. This situation becomes critical when dealing with lowdimensional clustered datasets, where the time needed to sample ergodically the trained models becomes computationally prohibitive. In this work, we show that this divergence of Monte Carlo mixing times is related to a phase coexistence phenomenon, similar to that encountered in Physics in the vicinity of a first order phase transition. We show that sampling the equilibrium distribution via Markov Chain Monte Carlo can be dramatically accelerated using biased sampling techniques, in particular, the Tethered Monte Carlo method (TMC). This sampling technique solves efficiently the problem of evaluating the quality of a given trained model and the generation of new samples in reasonable times. In addition, we show that this sampling technique can be exploited to improve the computation of the loglikelihood gradient during the training too, which produces dramatic improvements when training RBMs with artificial clustered datasets. When dealing with real lowdimensional datasets, this new training procedure fits RBM models with significantly faster relaxational dynamics than those obtained with standard PCD recipes. We also show that TMC sampling can be used to recover freeenergy profile of the RBM, which turns out to be extremely useful to compute the probability distribution of a given model and to improve the generation of new decorrelated samples on slow PCD trained models.
Current status:
Reports on this Submission
Report #2 by Anonymous (Referee 2) on 2022710 (Invited Report)
 Cite as: Anonymous, Report on arXiv:2206.01310v1, delivered 20220710, doi: 10.21468/SciPost.Report.5374
Strengths
(1) The paper recalls well the context and exemplifies the corner stone issue with classical methods in simple, pedagogical examples.
(2) The experiments are presented with significant detail allowing to better understand the inner workings and advantages of the algorithm.
(3) The proposed method enables both the learning of RBMs in multimodal cases (a.k.a. clustered datasets) but also offers new opportunities in evaluating the quality of training of these generative models.
Weaknesses
(1) The proposed method can only apply when one can identify a few dimensions that will separate the modes.
Minors:
(2) Some minor details are not entirely clear in the numerical sections (see report below) and the text of this result section could a bit synthesized to ease the reading.
Report
The paper proposes a detailed demonstration of the application of a biased MCMC for the training of RBMs. Namely, the authors suggest using the principal components of the training set to guide a tethered Monte Carlo when computing the gradients of the loglikelihood. The paper is clearly written and presents convincing experiments to demonstrate that the method can be applied in practice  within its domain of application. Indeed, as pointed out by the authors, the limitation of the proposed method lies in (1) the type of structure required in the data and (2) the ability to identify this structure from the training set. Although one can expect that these requirements are stringent and will not be met in most usecases, I would argue that the present paper raises an important question and offers a relevant discussion: few works have discussed the equilibration issue of MCMC in RBM training.
Requested changes
Questions for the authors:
 It is mentioned that MNIST images are qualitatively better  could the author back this claim by including such a figure?
 Do they authors understand why the TMC trained RBMs seem to have shorter relaxation times for SMC algorithms?
 In particular for MNIST, do the author understand why the TMC trained RBMs is well sampled by an SMC? Does this suggest the MNIST 01 modes are actually not that separated? Would that also be the reason why it is usually considered that MNIST can be learned with CD or PCD by an RBM (as the authors demonstrate that the multimodality a priori prohibits any learning with traditional MCMCs for clustered datasets)?
Minor:
 There seems to be a notation change at the bottom of page (8) where the hidden variables are denoted by \tau.
 Also, what does “permanent chain” refer to? Is it the “persistent chain”of PCD?
 Maybe a terminology slightly confusing is the employed “lowdimensional datasets”, “6.1.1. onedimensional dataset”. Actually it is not entirely clear whether the considered datasets have actually lowdimensional subspace support embedded in high dimension, or whether the only assumption is that a projection onto a lowdimensional subspace is enough to differentiate the clusters. Can the authors clarify?
 In figure 5B, why are there two red lines?
 In figure 5C, what does the black points represent? The dataset?
 Taking only the 01 in MNIST should yield about 10,000 images rather than 10ˆ5 (page 16).
 Unless I missed something, it would be useful to describe more the human genome dataset somewhere in the paper. Are the data binary? In which dimension? Is there one variable per gene?
Report #1 by Anonymous (Referee 1) on 202275 (Invited Report)
 Cite as: Anonymous, Report on arXiv:2206.01310v1, delivered 20220705, doi: 10.21468/SciPost.Report.5336
Strengths
1 Novel approach for sampling from and training Restricted Boltzmann Machines based on the Tethered Monte Carlo biased sampling algorithm.
2 Tackles an important limitation of RBMs: its inability to reliably learn from clustered datasets
3 Wellgrounded and innovative approach
4 Thoroughly benchmarked on artificial and real datasets
5 Limitations are clearly stated and investigated in the main text
Weaknesses
1Proposed approach is slow, as there is no efficient way to sample from the Tethered distribution $P(v,h  \hat{m})$.
2 No runtime comparison between proposed TMC sampling algorithm and current stateoftheart sampler (parallel Gibbs).
2 Impossible to tract more than 23 order parameters.
3 MCMC equilibrium time may still diverge due to energy barriers in orthogonal directions.
Report
Restricted Boltzmann Machines (RBM) are simple twolayer neural networks that jointly learn a probability distribution and a representation of data. RBMs can faithfully model complex probability distribution despite their simple architecture and are highly interpretable. However, training RBMs requires evaluating the moments of the distribution, which is usually done by MCMC sampling. MCMC sampling is especially challenging for multimodal data sets, as in this case Markov Chain have very slow equilibration time.
Here, Béreux et al. propose to use a biased Monte Carlo sampling scheme originally introduced for physical systems with multiple coexisting phases. The Tethered Monte Carlo method (TMC) consists in:
1. Introducing one or few auxiliary variables $\hat{m}$ coupled to the original system, such that $< \hat{m}  v,h > = m(v)$ with $m(v)$ one order parameter computed from data (here, a principal component) and $Var( \hat{m}  v,h) = 1/\alpha$ and $\alpha$ is a (large) tunable parameter.
2. Repeatedly sampling from the conditional distribution $P(v,h  \hat{m} ) $ for values of $\hat{m}$ covering all possible values.
3. Determining the marginal $P( \hat{m})$ by numerical integration of
4. Estimating moments of the original distribution using the law of total probability
The rationale being that often, the conditional distribution is easier to sample from than the original distribution. The authors test their approach in samplingonly and full learning settings on i) 1D and 2D artificial datasets ii) the MNIST0/1 and iii) human genome dataset.
They find that for simple multimodal data, samples generated from the TMC sampler faithfully represents the true, multimodal distribution (unlike naive Monte Carlo), leading to a successful training. For realistic data, the TMC sampler significantly improves upon the naive sampler, but the authors note that equilibrium times can still diverge due to slow mixing along directions orthogonal to the tethered ones. The practical limitations of the approach (as stated by the authors) are: i) Lack of efficient, factorized sampling for the conditional distribution and ii) Inability to tether more than 23 order parameters.
The proposed approach is innovative, wellimplemented and carefully benchmarked. In principle, it efficiently solves the obnoxious problem of sampling for “vanilla” clustered datasets. Limitations of the approach for reallife datasets are properly investigated in the main text (but not in the abstract, see below). My main concern is that, as stated by the authors, the approach is too slow for largescale data where the number of variables is ~ 1k10k. CPU runtime comparison between SMC and TMC should be added in the paper for clarity. Also, it is important that the speed/applicability limitations are stated in the abstract. The following claim in the abstract is not sufficiently substantiated:
"This sampling technique solves efficiently the problem of evaluating the quality of a given trained model and the generation of new samples in reasonable times. "
Altogether, I recommend publication with minor revisions.
One suggestion: Did the authors experiment on tethering one or few hidden units rather than the principal components ? Sampling from the conditional P(v,h  \hat{m} ) would be substantially easier as conditional independence would be preserved. Moreover, hidden units may play a similar role as principal components, as they learn the main collective modes of variation of data and thus have slow dynamics. It should be possible to dynamically pick the tethered hidden units using heuristics such as weight norm / hidden unit importance or based on the autocorrelation function of their marginal. If two or three hidden units are chosen, they should be uncorrelated to maximize diversity.
Requested changes
1 Please provide runtime comparisons for TMC against SMC.
2 Please state limitations of the proposed approach in the abstract
3 Figure 6D: it seems that the training has not converged after 2K updates. Please show more training points to validate convergence or discuss potential issues.
4 Please define in the paper the integrated autocorrelation time tau_int rather than referring the reader to a 500+ pages text book.
5 Typo p16: “just as we discussed in the contest of the artificial 1D dataset.”
6 Labels of Figure 9A seem swapped. Please check.
7 Spelling p19: “ minima, and that in addition, even if one can “cherry peak" the exact number of updates where”