SciPost Submission Page
Extrapolating Jet Radiation with Autoregressive Transformers
by Anja Butter, François Charton, Javier Mariño Villadamigo, Ayodele Ore, Tilman Plehn, Jonas Spinner
This is not the latest submitted version.
Submission summary
| Authors (as registered SciPost users): | Javier Mariño Villadamigo · Ayodele Ore · Tilman Plehn · Jonas Spinner |
| Submission information | |
|---|---|
| Preprint Link: | https://arxiv.org/abs/2412.12074v1 (pdf) |
| Code repository: | https://github.com/heidelberg-hepml/jetgpt-splittings |
| Date submitted: | Dec. 31, 2024, 11:47 a.m. |
| Submitted by: | Jonas Spinner |
| Submitted to: | SciPost Physics |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approaches: | Computational, Phenomenological |
Abstract
Generative networks are an exciting tool for fast LHC event generation. Usually, they are used to generate configurations with a fixed number of particles. Autoregressive transformers allow us to generate events with variable numbers of particles, very much in line with the physics of QCD jet radiation. We show how they can learn a factorized likelihood for jet radiation and extrapolate in terms of the number of generated jets. For this extrapolation, bootstrapping training data and training with modifications of the likelihood loss can be used.
Author indications on fulfilling journal expectations
- Provide a novel and synergetic link between different research areas.
- Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
- Detail a groundbreaking theoretical/experimental/computational discovery
- Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Reports on this Submission
Strengths
-
Detailed explanation of physics motivation for the use of autoregressive transformers for the modelling of QCD events with an approximately iterative pattern.
-
Comprehensive test of multiple extrapolation methods (naive, bootstrapping with truncated loss and override loss) with description of the failure modes of each method.
-
Novel use of override loss that is computed based on a transverse momenta imbalance.
Weaknesses
-
Difficult to follow how the bootstrapping approach allows the network to learn the correct multiplicity for 7-jet events. I understand that the fraction of generated events in the training dataset is a hyperparameter. It is clear from Fig 5 and Fig 8 that the bootstrapping increases the number of generated 7-jet events compared to the naive extrapolation.
-
Lack of explanation for why using the override loss results in different kinematic distributions compared to the case when using the truncated loss.
Report
I recommend publication after some minor revision.
Requested changes
-
Regarding point 1 in "weaknesses": A figure visualizing the bootstrapping workflow, combined with a more explicit step-by-step explanation in the main text, would significantly improve the clarity of this method.
-
If possible, please provide some possible explanations for the differences in the generated kinematic distributions when using the two different losses: truncated vs override.
Recommendation
Ask for minor revision
Report
Summary:
The main contribution of this paper is a study of how well a probabilistic autoregressive transformer can extrapolate to generate events with larger N jets (7 or 8) than seen during training (up to 6). Since the extrapolation is poor “out of the box”, the authors identify 3 innovative ways to improve it. The first involves a series of secondary trainings that introduce examples generated by the pretrained network that happen to fall inside the extrapolation region. Two additional techniques involve modifying the loss function to avoid learning the sharp cutoff in the N jets distribution in the training data and imposing constraints based on transverse momentum conservation. Each approach improves extrapolation, with the second two performing best, and each exhibits some difficulty in modelling compared to in-distribution tests. The paper is a good example of documenting intermediate steps in research, including some steps that were less effective than others, making it very instructive and thought-provoking for readers.
General comments:
-
It would enhance the paper’s appeal to a broader audience to state how this generative model, in particular, could benefit future collider experiments. If the model can accurately generate Z+jets events, could it do so with better alignment to real data? You mention new analysis techniques – could you expand on that? Traditional event generators are already pretty fast, so speed is probably not the main opportunity.
-
The linear binning in Figures 6, 7, 9, and 11 makes it difficult to judge the agreement with truth beyond the bulk of the distribution, which is critical for momentum spectra. This is especially true for the sum-of-pT distributions where the x-axis range seems a bit prematurely cut off. Could you please try logarithmic binning or even simply increasing the bin width? Similarly, the ratio panels would probably become more informative with smaller statistical uncertainties on the truth and perhaps a larger y-axis range.
-
For each of the extrapolation techniques, after pointing out the areas of clear mismodelling, the authors argue that these results demonstrate that “autoregressive transformers can learn the universal nature of jet radiation.” This appears to beg the question: “why is there mismodelling if the model in fact learned universal jet behavior”? Either the variables considered do not actually test these universal properties, or, if they do, the model failed to learn them (at least beyond the training data).
-
This paper includes an appendix of nearly 6 pages discussing a novel approach that is otherwise only mentioned in the last sentence of the outlook section. How could this side-investigation be more naturally included in the paper’s main discussion? If this proves too difficult, have you considered publishing the DiscFormer work in a separate paper?
Ideas for future work:
-
From what I can tell, you don’t generate jet constituents, but rather the jets themselves. How hard would it be to couple your jet-level transformer with a constituent-level transformer? The latter could be conditioned on the jet features and could be evaluated independently for each jet. This limitation/opportunity might be worth discussing in the Outlook section, but it’s totally up to you.
-
Transverse momentum conservation is a condition you check to assess model performance, but what if you could incorporate it directly into the model design? This way, the model would not have to learn it implicitly and might converge better. It appears that you actually do something similar in your “override” technique when you use momentum conservation to help guide the model’s extrapolation to high jet multiplicity.
Inline comments:
Abstract: “Autoregressive transformers allow us to generate events with variable numbers of particles, very much in line with the physics of QCD jet radiation.” ⇒ This makes it sound like autoregressive transformers are unique in this respect, but there are other ways to handle variable-length outputs. I would suggest: “Autoregressive transformers are an effective approach to generate events…”
P2: “The goal of this paper is to show, for the first time, that a generative transformer can extrapolate in the number of jets and generate approximately universal jet radiation for higher jet numbers than seen during the training.” ⇒ Can the models in the above references also extrapolate, or is this unique to the autoregressive architecture?
P3: “The iterative structure of Eq.(1) allows us to simulate parton splittings as Markov processes” ⇒ If the splittings are strictly Markovian, doesn’t that imply that the probability for each to occur is agnostic to the details of the preceding splitting history? If this assumption is valid, why do we need a transformer to model the sequence of splittings? There would be nothing to learn about the correlations between one splitting and the next, which are exactly what attention is designed to capture. In this case, it wouldn’t be incorrect to use a transformer, but it should be equally effective to use a simpler model that ignores correlations between elements in the set. It would be instructive to clarify this point.
P4: “2.2 Z + jets dataset” ⇒ It appears that MadGraph, Pythia, anti-kT, and FastJet should to be cited. Similarly, it would be better to either cite, unpack, or drop the reference to “CKKW”.
P5: “500M events” ⇒ Please state the CoM energy and whether these are proton-proton collisions
P5: “The jets are defined with FASTJETv3.3.4 using the anti-kT algorithm” ⇒ R0.4 jets? Any rapidity cuts?
P5: “muons” ⇒ are there also Z->ee events or only muons? Just a clarification, since only “leptonically” was specified above.
P5: “ordered in transverse momentum” ⇒ ordered in descending transverse momentum (?)
P5: “For 10 jets the phase space is 45-dimensional.” ⇒ Shouldn’t it be 410 + 32 = 46? Is the leading muon phi degree of freedom squashed?
P6: “the sequence of particles x1, . . . , xi−1” are you referring here to jet constituent particles following Parton shower (many) or the progenitor partons (few) that correspond roughly to individual jets?
P6: “The Kronecker delta” ⇒ Add “, $\delta_{in}$,”
P7: “For Zµµ+jets events, we also treat the muons autoregressively and enforce a splitting probability of one for them.” ⇒ Why would you have the model learn this when it is obvious from physics considerations that the muons must be present? You could simply focus on the hadronic content of the event.
P7: “we factorize the likelihood of individual particles pkin(xi+1|x1:i) in terms of their components. The ordering of components can affect the network performance” ⇒ This makes it sound like each particle’s pt, eta, phi, mass get generated one after the other as a sequence (just like the particles themselves). I think that all you mean to say is that you split the joint distribution into a product of 4 separate 1-D distributions. However, looking closely at Eq. 17, I see that each subsequent factor also uses as input the sampled kinematic value from the previous factor. This was not immediately clear to me. Can you please try to clarify the text, and also motivate why these “i+1” dependencies are useful?
P8: “Autoregressive transformer” ⇒ There are a few details missing about inputs and outputs. Please update the text to address the following questions:
a) Input: - What is the start token? - Which event-level features are used to condition the generation, and how is it embedded?
b) Output: - Are generated particles ordered in decreasing pT? I.e. is the last jet generated always the lowest-pT? - Do the muons have to be generated from scratch or do you copy the from the generator? - Is particle ID generated, or is there some other way to distinguish the muons?
P13: “However, there are deviations in the kinematic features from the truth that are not covered by the Bayesian uncertainty.” ⇒ This makes sense. It seems that the bootstrapping procedure would bring the 7-jet events into the training distribution in terms of their cardinality, but not quite in terms of their (potentially different) kinematics, since this wouldn’t have been learned properly by the network.
Formatting:
P9: “The jet multiplicity distribution is shown in Fig. 4” ⇒ Fig. 4 mentioned before Fig. 3
P14: “3.4 Extrapolation with override” ⇒ Should reference Fig. 8 again here (right-hand side)
Typos:
P3: “might be not be sufficient” P16: “this override approach significantly increasing the fraction” P17: “We find that all both approaches” ⇒ I assume you meant to say that “truncated” and “override” both perform similarly well, improving over naive extrapolation and bootstrapping.
Recommendation
Ask for minor revision
