SciPost logo

SciPost Submission Page

Loop Amplitudes from Precision Networks

by Simon Badger, Anja Butter, Michel Luchmann, Sebastian Pitz, Tilman Plehn

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Michel Luchmann · Tilman Plehn
Submission information
Preprint Link: scipost_202301_00016v1  (pdf)
Date accepted: 2023-03-15
Date submitted: 2023-01-11 11:43
Submitted by: Luchmann, Michel
Submitted to: SciPost Physics Core
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approach: Phenomenological

Abstract

Evaluating loop amplitudes is a time-consuming part of LHC event generation. For di-photon production with jets we show that simple, Bayesian networks can learn such amplitudes and model their uncertainties reliably. A boosted training of the Bayesian network further improves the uncertainty estimate and the network precision in critical phase space regions. In general, boosted network training of Bayesian networks allows us to move between fit-like and interpolation-like regimes of network training.

List of changes

Report 1:

1- In section 2, the authors describe the dataset they use, which is
taken from [8]. A set of cuts is then applied, but no justification for them
is presented. I understand that one must make a choice in the application of
cuts to avoid singularities. However, since later on the paper shows that
the performance deteriorates for large amplitude, i.e. presumably close to
the cut boundaries, one might be led to believe that the cuts are selected
to enhance performance. It would be useful if the BNN could for instance be
trained on looser or tighter cuts, to find out of the qualitative
conclusions would be different.

-> Originally, the cuts were meant to mimick the detector acceptance,
but at least for the 2-jet case they also cut off singularities. We
have added more information to the text.

As a side-question, why are the amplitudes real rather than complex numbers?

-> We corrected this, only the squared amplitued we learn are real. We
corrected this in the text as well.

2- Why do the authors choose to make use of a 20-dimensional
representation of the phase space, while its actual dimension is only
7-dimensional (2 initial-state momentum fractions, and 3×3−4=5
final-state components)? Would a lower-dimensional representation not
improve performance?

-> The goal of the paper was to show that it is possible to learn the
amplitudes precisely and at face value, for an actual application
we will exploit this additional handle, and we have early studies
showing that this does help. We added this discussion to the text.

3- At the beginning of section 3, it is claimed that the MSE loss limits
the performance of implementations without Bayesian architectures. What is
the justification for this claim? In particular, the BNN loss also has a MSE
term that is responsible for incentivizing the BNN to match predicted and
real amplitudes.

-> We have clarified that statement in this paragraph.

4- Under \textbf{Bayesian networks and uncertainties}, there is a
sentence 'By definition, the Bayesian network includes a generalized dropout
and an explicit regularization term in the loss, which stabilize the
training. I believe the authors mean, respectively, the sampling over the
network weights and the simultaneous output of σmodel, but
this is certainly not clear to the reader at this point in the text, and may
still be unclear after reading the full paper if one does not have the
prerequisite domain knowledge.

-> We clarified this sentence and added a reference to a long and
pedagogical introduction to our Bayesian network setup. We agree
that this paper requires some domain knowledge and believe that the
issue can be solved this way.

5- Further down, there is a sentence 'We implement the variational
approximation as a KL divergence.' This sentence has similar issues of
requiring the reader to be familiar with the lingo. I think the explanation
should state that p(ω|T) is the real distribution that is unknown, so
you introduce a trainable variational approximation q(w), and define a
loss function that minimizes the difference.

-> We adopted the referee's suggestion for this explanation.

6- Before eq. 7, it is stated that the evidence can be dropped 'if the
normalization condition is enforced another way'. I do not believe that this
'other way' is clarified elsewhere. I believe that it should just say that
the evidence term does not depend on any trainable parameters, and can thus
be dropped from the loss.

-> We changed the wording, to make clear that we enforce the
normalization by construction.

7- While the authors already attempt to do so, I think the distinction
between σmodel and σpred could be
clarified further. My understanding is that...

-> We are not sure what exactly the referee means, but we made an
attempt to explain this distinction more clear after Eq.(9).

8- The next paragraph then makes a link with the actual implementation of
the BNN. However, it says 'This sampling uses the network-encoded amplitude
and uncertainty values...', while the reader does not even know yet that the
model is meant to predict those. I would reorder some elements of this
section, by first saying that the BNN is essentially meant to model p(A|x,ω), ¯A and σmodel are its mean and
variance,
and you can thus implement it as a feed-forward network that predicts that
mean and variance as a function of x,ω.

-> We have followed this suggestion and adapted the paragraph before
Eq.(12).

9- Under \textbf{Network architecture}, is the uncertainty enforced to be
positive after the final 40→2 linear layer through some activation?
Does the mean have an activation?

-> We added additional information to the paragraph.

10- Figures 2 and 3, it took me a while to figure out what the differences
were between the plots. Maybe in figure 2 add some text like '(train)' and
'(test)' underneath the gg→γγg labels. In figure 3 it is
especially confusing that the weight-dependent pull is shown above the
weight-independent pull, while the reverse is true in the text. I would
suggest splitting it into two figures, and adding some text to indicate
which pull is being plotted (even though it is also shown on the x-axis).

-> We have split Fig.3 and adjusted all figure captions.

11- The results before boosting show a bias towards positive values. It is
unclear to me if this is also captured by the model uncertainty. Please
elaborate.

-> We added a brief discussion of this effect, now at the end of the
section and a motivation for the boosting.

12- I think the reference to [14] should be supplemented by other works
that use this technique, like 2009.03796, 2106.00792. The comma after [14]
should also be a period.

-> We added the two references for re-weighting.

13- Above section 4.2, 'over-training though loss-boosting' →
'over-training through loss-boosting'

-> Changed.

Report 2:

1 - Section 2, what quarks run in the loop? Are top/bottom quarks included and, if so, are the masses fixed or left
as free parameters of the network?

-> We use the amplitudes from Ref.[9], generated with all quark
flavors, but do not attempt to infer the model parameters. Thinking
about, this would be an interesting question, but beyond the simple
precision-regression discussed here.

2 - Throughout the draft the authors appear to use the word "amplitudes" to mean "samples of the
amplitude"/"amplitude samples". In typical usage, one might say that the gg -> gam gam g process has a single
amplitude (or perhaps several helicity or color-ordered amplitudes). However, the authors refer to "90k training
amplitudes". Here I suspect they mean a single (or few) amplitudes sampled at 90k phase-space points.

-> We agree that our wording is not well-defined, so we now define it
on p.3. The referee is right, this is what we mean and now say.

3 - Section 2, the authors write that each data point consists of a real amplitude. Usually, an amplitude is a
complex number. I suspect the authors are referring to the squared amplitude. If this is correct, this should be
stated more clearly.

-> We clarified this aspect.

4 - The authors use a highly-redundant 20-dimensional parametrization. Why do the authors not use, for example,
just the independent Mandelstam invariants? Can the authors demonstrate that a lower dimensional parametrization is
not better for their training (as one might naively expect)?

-> We now mention in the paper that the goal was to learn the squared
transition amplitude as precisely as possible, and with minimal
pre-processing. An optimized pre-processing will most likely be the
topic of a follow-up paper, and we do have initial results
indicating that a lower dimensionality helps quantitatively, but
does not change the qualitative results shown in this paper.

5 - Below Eq(6) the authors write "if we enforce the normalization condition another way" in order to justify
dropping the final term in Eq(6). The exact procedure is not clear to me, where is the normalization condition
enforced? Perhaps this sentence can be reworded for clarity.

-> As requested also be Referee 1 we re-worded this paragraph to make
it clear.

6 - In the first line of Eq(8) the authors refer to p(A|w,T). Perhaps I am misunderstanding or missing a step, is
this a typo or does this indeed follow from Eq(4)?

-> Thank you for pointing this out, we corrected the typo.

7 - In the second paragraph of "Network architecture" the authors write "The 2->3 part of the reference process in
Eq(25)", do they mean to reference instead Eq(1)?

-> Again, thank you for pointing this out, it should be Eq.(1).

8 - Comparing the \delta^(test) panel of Figure 2 with that of Figure 6 it naively appears that significantly more
test points fall into the overflow bin (for the largest 0.1% and 1% of amplitude values) after loss-based boosting.
Could the authors please comment further on this and to what extent do they consider it a problem?

-> We consider this a problem and one of the motivations to move
towards the preformance boosting, now clarified in the text.

9 - In Figure 8, although the \delta^(test) performance does seem to be broadly improved, again, significantly more
test points fall into the overflow bin than in Figure 2 or Figure 6. Could the authors comment further on this?

-> Our main point is that performance boosting provides comparable
precision for all amplitudes, not just the majority of
amplitudes. We clarify this point in the discussion of (now)
Fig.9.

10 - Sec 5, final paragraph, the authors write that "The uncertainties for the training data still cover the
deviations from the truth, but unlike the central values this uncertainty estimate does not generalize correctly to
the test data". If I have understood Figure 10 correctly, this fact is visible in the bottom (model/test) plot of
the lower panel, where the green band no longer reflects the true uncertainty (which is presumably ~the grey band).
One of the strengths of the authors' approach is that it provides not only an estimate of the amplitude but also of
the uncertainty of the estimate. The authors write "This structural issue with process boosting could be
ameliorated by alternating between loss-boosting and performance-boosting", can the authors demonstrate that an
additional loss-boosting step would improve the quality of the uncertainty estimate without reversing the
improvement in the performance? This would be a very strong and convincing argument for using their proposed
procedure.

-> We agree with the referee, and we have evidence that for individual
trainings the alternating training works, but unfortunately we do
not have a reliable algorithm or method which we could present in
this paper. We are at it, through...

11 - Sec 6, there is a typo "boosteing"

-> Corrected, thank you!

Published as SciPost Phys. Core 6, 034 (2023)


Reports on this Submission

Anonymous Report 2 on 2023-1-26 (Invited Report)

Report

I would like to thank the authors for addressing all of the points made in my initial report. Their response has clarified my open questions.

This work addresses an important question in a significant way and is well-presented. I would recommend publication in SciPost Core.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Anonymous Report 1 on 2023-1-11 (Invited Report)

Report

I am happy with the changes the authors have implemented, and would recommend the paper for publication.

  • validity: -
  • significance: -
  • originality: -
  • clarity: -
  • formatting: -
  • grammar: -

Login to report or comment