SciPost logo

SciPost Submission Page

How to Deep-Learn the Theory behind Quark-Gluon Tagging

by Sophia Vent, Ramon Winterhalder, Tilman Plehn

Submission summary

Authors (as registered SciPost users): Tilman Plehn · Sophia Vent · Ramon Winterhalder
Submission information
Preprint Link: https://arxiv.org/abs/2507.21214v1  (pdf)
Date submitted: Aug. 15, 2025, 10:44 a.m.
Submitted by: Ramon Winterhalder
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approaches: Theoretical, Computational

Abstract

Jet taggers provide an ideal testbed for applying explainability techniques to powerful ML tools. For theoretically and experimentally challenging quark-gluon tagging, we first identify the leading latent features that correlate strongly with physics observables, both in a linear and a non-linear approach. Next, we show how Shapley values can assess feature importance, although the standard implementation assumes independent inputs and can lead to distorted attributions in the presence of correlations. Finally, we use symbolic regression to derive compact formulas to approximate the tagger output.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Awaiting resubmission

Reports on this Submission

Report #3 by Ezequiel Alvarez (Referee 3) on 2025-10-27 (Invited Report)

Strengths

1- Addresses the question of simulations validity in QCD 2- Obtains analytic formulas for an instantiation of a simulation

Weaknesses

1- Paper objectives are not clear

Report

The authors study the interpretability of a ParticleNet‑Lite quark–gluon tagger trained on low‑level jet constituents. They: - Analyze the multidimensional pooled latent representation using PCA and a disentangled latent classifier (autoencoder + classifier); - Assess feature importance via SHAP and discuss pitfalls from correlated inputs; - Approximate the classifier with compact analytic formulas via symbolic regression (PySR). They identify three leading latent directions aligned with familiar physics: (i) multiplicity/particle‑type diversity, (ii) radial energy flow, and (iii) fragmentation/energy dispersion. They propose decorrelated observables to mitigate correlation issues for SHAP and present analytic formulas that mimic a tagger trained on selected observables. Robustness is partially explored by testing PCA directions across Pythia and Herwig.

After reading the paper and going through some of its sections many times, I was still with the feeling that something is missing. In my understanding the reason for this is that it is not fully clear the objective of the paper.

From my perspective, I see two important points

1) Lower the dimensions to extract the right info: This is a common and useful –if not crucial– pattern in ML. Given that non‑perturbative QCD cannot be modeled perfectly, it is reasonable to seek low‑dimensional summaries—within a given simulator instantiation—that capture what the tagger needs. 2) Proposing analytic formulas that surrogate the network within one simulation setup.

The purpose of (2) is not sufficiently clear in the manuscript. This is not necessarily a problem if the authors argue convincingly for future utility. For example, such analytic formulas could serve as physics‑informed priors or mean functions in a Bayesian framework (e.g., estimated via Hamiltonian MC). One could promote the formula coefficients to parameters with weakly informative priors and let the Bayesian inference update them on data, or place a Gaussian‑process perturbation around the formula and infer hyperparameters from data.

However, to pursue these or related goals, a more robust assessment of simulation independence is needed, especially for the analytic formulas. I encourage the authors to focus on the robustness of the formulas themselves.

Concrete suggestions for strengthening robustness: - Go beyond a binary Pythia vs Herwig comparison by including multiple Pythia tunes. It is valuable not only to show case where tune dependence is mild, but also to map out the limits of robustness. - Cross‑tune transfer: train the symbolic formulas (and the small ML surrogates they approximate) on tune A and evaluate them on tune B; then vary the tunes until AUC/rejection/calibration degrades noticeably. This will quantify the range over which the formulas are reliable and where they break down.

Future‑work idea (optional, but illustrative of utility). If formulas trained on tune A degrade on tune B, compare “use‑as‑is on B” vs “Bayesian recalibration on B” in which the coefficients are inferred on B after seeing the data, while simultaneously using the formula for tagging. One can expect the latter to perform substantially better; this would showcase the practical value of having a compact analytic surrogate.

Recommendation I recommend publication after major revision.

Requested changes

1- Clarify and sharpen the paper’s objectives early in the introduction. 2- Tone down the title and several statements that currently oversell the results; align claims with demonstrated evidence. 3- Substantially expand the robustness study of the analytic formulas across simulations: include multiple Pythia tunes and explicitly identify the regime where performance meaningfully degrades (AUC, or also rejection or calibration), so readers can gauge the limits of simulation independence.

In my understanding, these changes would make the paper’s aims explicit, better calibrate the claims, and provide the robustness evidence needed for the proposed use of analytic surrogates.

Recommendation

Ask for major revision

  • validity: good
  • significance: good
  • originality: high
  • clarity: good
  • formatting: excellent
  • grammar: perfect

Report #2 by Anonymous (Referee 2) on 2025-10-17 (Invited Report)

Report

The manuscript titled "How to Deep-Learn the Theory behind Quark-Gluon Tagging" presents a thoughtful and timely XAI study of quark–gluon tagging, combining PCA/DLC analyses, SHAP attributions, and symbolic regression to connect ParticleNet-Lite to physics-motivated observables. The results are promising—e.g. clear latent directions (multiplicity, radial profile, fragmentation) and a compact multi-observable surrogate. Before publication, however, I would like to see a few points regarding clarity/notation, reproducibility and robustness. If the authors address the specific points listed below, I would be happy to recommend publication. My overall recommendation at this stage is major revision.

Requested changes

  1. Please report the exact ParticleNet-Lite baseline metrics directly in Fig. 3 or in the accompanying text: the AUC and the background rejection at 30% efficiency. Indicating both values will make comparisons to the PCA/DLC/Symbolic regression variants immediately clear.

  2. Please fix Eq. (7) for consistency with the zero-centering step. It should read Z=(X-mu_X) V (since Eq. (5) uses mean-centered features), unless you explicitly redefine X to already denote mean-centered data. In the latter case, please state this redefinition clearly before Eqs. (5)–(7).

  3. Please state explicitly how the class labels map to the network output. For clarity to non-experts, add a sentence early in Sec. 2 (or where the classifier is introduced) that the model outputs the quark probability (quark = 1, gluon = 0), i.e., scores closer to 1 indicate quark-like jets and scores closer to 0 indicate gluon-like jets.

  4. Please improve the visual distinguishability in Figs. 3 and 10. Several curves are rendered with similar shades of blue, which makes them hard to tell apart.

  5. Page 6 Typo "correllation"

  6. Page 8 Please fix the notation in Eq. (18): replace S with S_{frag} to avoid confusion with other entropies.

  7. Please add the label “reconstruction” next to x' in Fig. 6 to make the reconstruction head explicit.

  8. Please remove the extra (') in Table 1.

  9. Please add a brief stability analysis of the DLC latent vectors across random seeds. In particular, report whether the identities (and ordering) of z1, z2, z3​ are consistent under retraining, and how stable their correlations—and the resulting ranking—with respect to key observables are. This will clarify whether the physical interpretation assigned to z1, z2, z3​ is robust.

  10. Please add the necessary implementation details to ensure reproducibility of both the DLC and the symbolic-regression pipelines. Alternatively, a public repository containing training scripts and configs would fully address this point.

  11. If I understand correctly, Fig. 8 is a scatter plot? If so, please state explicitly in the Fig. 8 caption that it is a scatter plot of SHAP values and the features are ordered by the mean absolute SHAP value across events. If I have misunderstood the plot or the ranking criterion, please clarify what a dot represents and the exact ordering rule.

  12. As a suggestion, please consider adding, in the Symbolic Regression section, score‐distribution plots on the test set for each optimal formula (1D/2D/7D). Overlay the quark and gluon histograms and compare them on the same axes with the corresponding MLP and ParticleNet-Lite outputs. These plots would reveal where the analytic formulas diverge from the learned models (e.g., in tails).

  13. As a suggestion aligned with your Outlook, maybe you could include a lightweight, reproducible benchmark where the optimal symbolic-regression formulas (1D/2D/7D) are evaluated as fast surrogates on an independent quark/gluon dataset and compared to ParticleNet-Lite in inference latency, memory usage and AUC.

  14. The title feels overstated. Unless Eq. (31) can be endowed with a clear, physics-grounded interpretation, I recommend softening the claim in the title.

  15. As EFP (Energy Flow Polynomials) provide a mathematically complete and IRC-safe basis for jet observables and have been used in prior work for interpretable or surrogate models, it would be valuable if the authors could comment on the choice of not using an EFP basis for symbolic regression.

Recommendation

Ask for minor revision

  • validity: good
  • significance: good
  • originality: good
  • clarity: high
  • formatting: excellent
  • grammar: excellent

Report #1 by Anonymous (Referee 1) on 2025-9-30 (Invited Report)

Report

The manuscript "How to Deep-Learn the Theory behind Quark-Gluon Tagging" studies the problem of discrimination of quark- from gluon-initiated jets in the context of machine learning. The goal of this study is to develop high-level, interpretable observables that have (almost) the same performance as state-of-the-art architectures. The authors approach this problem through various methods for reducing the dimensionality of the relevant observable space, as well as comparing to machine-learned functions of various observables known to perform well at this problem. While this work is very topical and in its motivation and goals is rather novel, due to serious shortcomings in the analysis, I cannot recommend it for publication.

I have many particular points that I will address below, but I will first describe the overall problem I see with this endeavor as laid forth in this paper. Unfortunately, these issues are so pervasive, I do not think that any form of this paper could be publishable without an entirely new analysis.

The goal of this paper, as the authors state several times, is to produce performant and interpretable observables through various machine learning techniques. The performance aspect of this goal is rather well-defined, as a quantifiable metric for discrimination of quark versus gluon jets. The authors mostly focus on the area under the ROC curve (AUC), which may have some problems, but is nevertheless concrete and quantifiable. By contrast, the "interpretability" aspect of the goal is exactly the opposite of this, very heuristic, hand-wavy, and imprecise. In particular, the arguments that the authors make for interpretability are extremely dated, which is very unfortunate.

In particular, I feel like the authors have relied too much on heuristics of QCD, and not concrete calculations, numerical predictions, systematically-improvable approximations, etc., that make QCD the robust theory it is. The interpretations of quark and gluon jets and observables that are useful for discrimination read like that part of the paper was written in 2010. Specifically, in jet physics there has been significant advances in calculational ability, level of precision, and sophistication of techniques over the past 15 years. As a concrete example, in the context of the symbolic regression study in Section 5, the authors train a network to output an appropriately optimal form of observables, given particular input high-level observables, mathematical functions and operations. Again, this approach is rather novel and could be very informative and intriguing.

However, the set of observables and the mathematical functions that the machine can manipulate frankly don't make much sense. First, the high-level observables are a rather ad-hoc set of things that work (more on this later), but with no guiding principle for their inclusion. Second, a particular function that apparently does much of the heavy lifting of the regression is the hyperbolic tangent. I must admit, I can't think of many (or perhaps any?) results from calculations in QCD in which a hyperbolic tangent appears. As such, I have no interpretation for how such a function would arise in QCD, and correspondingly don't understand what the result is supposed to mean.

The authors could have included knowledge from QCD in parallel with the machine learning studies to find a useful and interpretable observable through symbolic regression. For example, for many of the high-level observables considered by the authors, the functional form of the distribution of those observables is known. Further, in many cases, these functional forms are known at a higher accuracy in perturbation theory than the parton shower event simulation employed in this work. The authors could have incorporated this QCD domain knowledge into the regression task, which would be significantly more enlightening as to where these functions could come from. As just three examples:

1) The distribution of particle multiplicity in a jet is well-known to closely follow KNO scaling [1,2]. Further, it is also well-known that the negative binomial distribution describes the multiplicity distribution well, see, e.g., [3].

2) The observable S_PID which measures the entropy of particle identification, is closely related to the jet charge [4]. Dominantly, pions are produced in a jet, and the three pions have distinct electric charges. The distribution of the jet charge is known to be a Gaussian conditioned on multiplicity, by the central limit theorem [5].

3) At leading-logarithmic accuracy, the distribution of IRC safe observables like the energy correlation functions or the jet width take the Sudakov form of an exponential function with a double logarithmic argument.

In each of these examples, the particular function that appears in the distribution has a physical origin, meaning and interpretation. The hyperbolic tangent used in this paper lacks that sort of richness and connection to the physics of QCD. Is there a reason that none of this information is used?

More specific issues follow. I want to emphasize that this is not exhaustive, as well, as sometimes I just identified the first time something was used or stated.

1) In the introduction, the authors mention some subtleties of quark versus gluon tagging, such as ambiguity beyond leading-order. However, in Sec. 2, without any discussion, they apparently use "Monte Carlo" definitions of quark and gluon jets, as the output of a parton shower event simulator, according to selected initiating process. This is not satisfactory, especially given significant work in defining quark and gluon jets, or on jet flavor definitions, etc., over the past decade or more. The authors are welcome to define quarks and gluons in this Monte Carlo way, but their results are then not robust nor do they have any physical meaning.

The approach should instead be opposite. The authors need to provide a robust definition of quark and gluon jets in some, physical, way. E.g., in a semi-supervised or even unlabeled event context, like different populations of different jet types at different pseudorapidities. Then, they could establish the features that a machine learns and exploits in the different populations to classify them as "quark" and "gluon". As such currently, this approach only teaches readers about what Pythia does, which is not Nature.

2) Why the observables in Eq. 8? There is a strange disconnect between the structure of ParticleNet, which is some general, extremely expressive architecture that can capture subtle and non-trivial correlations between particles (or with its Lite variant). Then the authors pick 4 high-level observables that, while each are known to perform decently on the problem of quark versus gluon jets, there is no sense that their combination is "better". Instead, one should approach the comparison in an analogous way as ParticleNet, with some sufficiently expressive high-level observables, that are, possibly IRC safe for robustness and calculability. Such a set of observables would be Energy Flow Polynomials, for example, but many such things could be used. With such a set of high-level observables, then their combination and improvement in performance as the number of observables is increased has some meaning, because more observables enables probing correlations on a smaller scale. Specifically, Energy Flow Polynomials are linear combinations of orthogonal harmonics on n-body phase space.

3) If I understand correctly, the correlations presented in Fig 4 are linear correlations (Pearson correlations). The authors actually do not define what they mean by "correlation" here, unless I missed it. This correlation is woefully insufficient for drawing conclusions. Non-linear correlations between these observables will be significant, so a much better measure of correlation must be employed. There are many measures of non-functional correlations that exist (mutual information, etc). Actually, mutual information is mentioned later, in the study of Shapley values, but still isn't used there.

4) The linear relationships expressed in Eq. 10 or Eq. 13 have no physical meaning. Or, rather, if they do, the authors need to mention that clearly and describe to the reader what that meaning is.

5) For all their statements about learning from the machine, the authors try to do a lot of physical interpretation themselves, which is heuristic at best, and very outdated in its approach. Heuristics like discussed below Eq. 15 for motivating this particular observable were common in the field 15 years ago, but with significant theoretical advances in calculations for problems like this, are not really employed anymore. Anyway, I thought the point of this paper was to go beyond these heuristics anyway.

6) Fig 5: I don't really understand what the caption is saying. Both figures show the correlation coefficient with PC1, PC2 and PC3.

7) Again, what is the reader supposed to learn from Eq. 18? The authors need to spell it out clearly.

8) I don't follow Eq. 20. You want to identify non-linear correlations, so you add a term in the loss function with the covariance. The covariance only encodes linear correlations. So how does this solve the problem? Can't this term vanish for observables that are perfectly correlated, but in a non-linear way?

9) I don't understand the blue-red gradient in Figure 8. What does "Feature Value" mean, and are "High" and "Low" quantifiable?

10) The authors make many, many statements without any quantifiable evidence. For example, the middle paragraph on page 12, in which they interpret what is happening with wpf, is imprecise. Also, in the last full paragraph on page 12, the authors state that these observables are correlated, but provide no evidence of non-linear correlation, like mutual information. How are these observables correlated, quantifiably?

11) With all of the additional interpretation and decorrelation that the authors seem to need to do, I fail to see the utility of Shapley values to this problem. The authors are only applying this analysis to collections of 6 observables, and they find odd results. So, they then work to reduce correlations between observables by hand. How could this ever generalize to much higher dimensionality?

12) Page 14 in "Setup and method": the authors state that they select observables based on "performance and interpretability". What is "Interpretability"? Is this quantified?

13) I guess i don't understand what Delta C in Eq. 21 means. Can the authors provide more discussion about this? It is a bit troubling that seems to explicitly depends on binning. Is this true? I am especially confused through Fig 10 and Table 3. All of these equations listed in Table 3 are monotonically related. As such, the discrimination performance is unchanged from simply measuring npf alone. (I also don't understand the need for Ref 84 for this point; this is a consequence of the original Neyman-Pearson proof.) How can the "calibration" Delta C change? In all of these cases, you just measure npf, and then put that value into a formula. If anything, the expression at complexity 9, which has low Delta C, is impossibly uninterpretable.

14) The expressions in Table 4 are not interpretable in any colloquial sense. I learn no physics by staring at these equations, and the particular forms are highly specialized to the mathematical functions that are allowed. Table 5 and Equation 31 are even less interpretable. The authors should provide a quantifiable "interpretation" of this equation. On page 19 they say that Eq. 31 is interpretable, but provide no interpretation. Why not? How can it be interpreted? The authors present no theoretical calculations of observables in perturbation theory (or even in simplified models of QCD), which I would think would be the minimal baseline for true interpretability.

15) In conclusions the authors state: "Beyond confirming the established observables, our analysis suggests new, refined combinations of features that are not immediately obvious from theory." What does this mean? The authors provide no theory analysis, so the reader has nothing to compare to. QCD theory is not heuristics.

16) I must admit, I don't know what the point of the appendices is. What is the reader supposed to learn from them?

References:

[1] A. M. Polyakov, A Similarity hypothesis in the strong interactions. 1. Multiple hadron production in e+ e- annihilation, Zh. Eksp. Teor. Fiz. 59 (1970) 542–552.

[2] Z. Koba, H. B. Nielsen, and P. Olesen, Scaling of multiplicity distributions in high-energy hadron collisions, Nucl. Phys. B 40 (1972) 317–334.

[3] P. Carruthers and C. C. Shih, Correlations and Fluctuations in Hadronic Multiplicity Distributions: The Meaning of KNO Scaling, Phys. Lett. B 127 (1983) 242–250.

[4] R. D. Field and R. P. Feynman, A Parametrization of the Properties of Quark Jets, Nucl. Phys. B 136 (1978) 1.

[5] Z.-B. Kang, A. J. Larkoski, and J. Yang, Towards a Nonperturbative Formulation of the Jet Charge, Phys. Rev. Lett. 130 (2023), no. 15 151901, [arXiv:2301.09649]

Recommendation

Reject

  • validity: low
  • significance: poor
  • originality: low
  • clarity: poor
  • formatting: reasonable
  • grammar: good

Login to report or comment