SciPost Submission Page
Inferring flavor mixtures in multijet events
by Ezequiel Alvarez, Yuling Yao
Submission summary
Authors (as registered SciPost users):  Ezequiel Alvarez 
Submission information  

Preprint Link:  https://arxiv.org/abs/2404.01387v2 (pdf) 
Code repository:  https://github.com/sequi76/Inferringflavormixturesinmultijetevents 
Date submitted:  20240429 03:15 
Submitted by:  Alvarez, Ezequiel 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approach:  Phenomenological 
Abstract
Multijet events with heavyflavors are of central importance at the LHC since many relevant processes  such as $t\bar t$, $hh$, $t\bar t h$ and others  have a preferred branching ratio for this final state. Current techniques for tackling these processes use hardassignment selections through $b$tagging working points, and suffer from systematic uncertainties because of the difficulties in Monte Carlo simulations. We develop a flexible Bayesian mixture model approach to simultaneously infer $b$tagging score distributions and the flavor mixture composition in the dataset. We model multidimensional jet events, and to enhance estimation efficiency, we design structured priors that leverages the continuity and unimodality of the $b$tagging score distributions. Remarkably, our method eliminates the need for a parametric assumption and is robust against model misspecification  It works for arbitrarily flexible continuous curves and is better if they are unimodal. We have run a toy inferential process with signal $bbbb$ and backgrounds $bbcc$ and $cccc$, and we find that with a few hundred events we can recover the true mixture fractions of the signal and backgrounds, as well as the true $b$tagging score distribution curves, despite their arbitrariness and nonparametric shapes. We discuss prospects for taking these findings into a realistic scenario in a physics analysis. The presented results could be a starting point for a different and novel kind of analysis in multijet events, with a scope competitive with current stateoftheart analyses. We also discuss the possibility of using these results in general cases of signals and backgrounds with approximately known continuous distributions and/or expected unimodality.
Author indications on fulfilling journal expectations
 Provide a novel and synergetic link between different research areas.
 Open a new pathway in an existing or a new research direction, with clear potential for multipronged followup work
 Detail a groundbreaking theoretical/experimental/computational discovery
 Present a breakthrough on a previouslyidentified and longstanding research stumbling block
Current status:
Reports on this Submission
Strengths
1. Well written and presented
2. Toy application of a less used Bayesian statistical method to a HEP problem
Weaknesses
1. A bit too much assumption of familiarity with statistical issues and distributions for a physics audience
2. Lack of clarity on generation and inference methodology, and on which physics information makes the problem tractable
3. Does not go beyond toy studies based on sampling from 1D curves which could have come from any topic area: the coupling to HEP is rather shallow (other than in limiting the multivariate mixtures to assume pair production of heavy flavour)
Report
A nice demonstration of Bayesian mixture methods to toy problems of taggingscore distribution inference, using four different statistical models. Convergence is shown to require additional physics information such as correlations between multiple jets in an event (this is in need of some clarification) or knowledge of e.g. smoothness and unimodality of the posterior distributions. The amount of physics in the paper is rather small, being more a context for statistical application than central to the conclusions as currently written. No physics simulation is explicitly used, instead simply sampling from provided 1D distributions, and inferring them as a closure test. On the basis that the connection between the physics content and the statistics content is rather small, I think this does not fulfil the SciPost Physics criteria of novelty and synergy between the two fields, though it is a useful demonstration of statistical methods to a physics audience: Physics Core is more appropriate on the declared conditions. It is well written and presented, and a solid contribution to the literature, but does need clearer explanation and perhaps deeper study of the additional information that makes inference possible in the 4jet case.
 In abstract and introduction, it would be helpful to clarify that "heavy flavour" in this work refers to bjets and cjets but not top quarks  which are heavy flavour but kinematically distinct and cannot usually be associated to a single jet.
 "Statically" > "Statistically" at the end of p2
 Subsections 2.1.x: it would be useful to have more explanation of the distributions' features and motivations, for accessibility to physicists, and more motivation/discussion of the 4 options' pros/cons from a physics point of view
 Sec 3 start: "In this Section" > "In this section" (not used as a proper noun here)
 Sec 3.1: it would be good to note that the identification of a hadronic jet with a specific parton flavour is not a complete picture  as that is equating a coloursinglet object with a nonsinglet and the QCD radiation structure of jets emerges from colour dipoles between partons  but that it has served well as an approximation, particularly for heavyquark tagging and to a much lesser extent for quark/gluon discrimination
 Sec 3.1: LHC collaborations are also increasingly using continuous or pseudocontinuous btagging, in which scores are used directly or in differential binnings rather than below/above thresholds for fixing working points. This introduces challenges of calibration, but of course provides more nuanced information for analyses.
 p6: "label switching along the inference process" needs some clarification: this whole sentence seems a bit cryptic to a typical physicist
 Sec 3.2: an explicit statement of the computing setup used for these modelling studies would be useful. Stan has been implied but is not explicitly stated, with references to e.g. the ordered vector use appearing in "computing font" but without enough context to understand it. Equally, samples sizes, etc. would be useful to have stated. (I see the computing is specified in Appendix A. Either bring it forward into the main text  I think think would be best  or explicitly instruct the reader to see App A for details and avoid having the main text refer to features that require the computing context.
 Sec 3.2: why are these choices of GP hyperparams the appropriate/natural ones?
 Sec 3.2, p8: how is it _known_ that the reason for nonimprovement in the GP case is "nonidentifiability of the problem"? And what exactly does this jargon mean... that the absence of other variables with which to correlate means there is not enough information to unmix the distributions?
 p8: the b distribution seems to have semiconverged, but in the opposite direction of movement from the prior than what was required. (The Dirichlet case also has a bit of this  less obviously converged, but with similar features.) What's the reason for that? Can this be sort of mismatch be predicted from the shapes of mixture functions that are being closuretested?
 p8: unimodality does not seem fully guaranteed: both the b and c distributions in Fig 1 have inflections on their tails that come close to, or marginally drift into, multimodality. In general is there a necessity that tagging score distributions will be unimodal when transported to a process other than that where they were trained/calibrated? For example, calibrations from ttbar events might well generate additional structures when applied to jets in events with more tops.
 p8: it would be good to float Fig 2 to the [b]ottom of the page, to avoid the bit of text on model 3 getting "trapped" there, and easily missed by the reader
 p8/9: it's claimed that the unimodal mixture model posterior "approaches the true values" by comparison to the priors (which it does not use). It does seem closer than the priorguided models, but still shows an excess near the c peak and a deficit on its RHS, similar to the priors  which is presumably a coincidence. As the priors aren't shown on the central plot, it's hard to exactly compare these, but I think a bit more discussion of the "approach" is needed to soften the conclusion, as it may be doing better but seems to be converging to something significantly far from the truth in several bins near the peak. The blue bjet distribution is much more poorly converged and described and this should be mentioned. The pointestimate model has _very_ similar features on the c distribution, which deserves some discussion as it's too close to be coincidence, and now the bjet distribution is better converged (on to one of the two implied fits from the unimodal posteriors) but is not accurate other than on the position of the peak. Basically, there's a lot more going on here, with interesting grouping of models, and it's too glib just to say that the last two models have "a fair approach to the two curves"  I think that's an overstatement of the level of improvement, particularly for the btag distribution.
 Sec 3.3: the generation of the 4jet data is not made clear. I can understand that physics process types could induce interesting correlations containing extra information, but here I _think_ the toy model just involves sampling again from the two GN1 curves: is this correct? So what structure (other than normalisation) is there to be exploited by the statistical model that doesn't just factorise into 4 x 1D problems? This may seem obvious to the authors, but I suspect not for many readers. Is there any coherent ordering of the jets, i.e. which scores go in which tuple entries, e.g. corresponding to pT ordering of the jets? Or is that random, in which correlations would presumably be eliminated?
 It's also not clear to me what are the relative normalisations of the two models (b and c) in either of the toy studies: depending on the signal channel of interest, and its main backgrounds in the signal region, the b and c rates could be enormously different... and this would presumably inject some physics into the mixture inference?
 Sec 3.3: Figs 36 _need_ to be brought forward to p10 onwards  it is extremely difficult to read and understand when the text discusses figures located 68 pages away.
 p10: The faraway Figs 710 also need to be moved closer. The text here discusses them in turn, so maybe they would work best inline, between each paragraph of discussion text.
 Discussion: As earlier, it's unclear where the extra information is coming from in the 4jet case. Either there are correlations in the toydata sampling which need more explanation, or is the extra information the restriction to even numbers of bjets? This would be a very good thing to be clear about, as it's key to understanding what information can be leveraged to make the mixture inference tractable. This is also key in the final paragraph of the discussion, which seems to be proposing a mechanism for inferring true distributions of observables  which of course is the whole aim of differential measurements and unfolding  but where the "multidimensionality" discussed is, I think, of a different character from that used here. (I am not convinced by this paragraph, since of course extraction of distributions this way has a lot of unconsidered prior art, but it is possible that Bayesian mixtures could add something in concert with established methods.)
 I expected there to be a nontoy study, e.g. using MC generation for a 4t model and seeing how the methods perform on that. I suppose obtaining some approximation to a btag score for real MC was too difficult, but it would have been interesting. Without physics input, the paper seems to "just" be a demonstration of Bayesian mixtureinference closure on two distributions that happen to have come from btagging, with a little physics content in the evenNb restriction of the mixtures in the 4jet case. I think this is fine, but it's good to be clear about the extent to which there is physics content in this application of inference.
 There are quite a few formatting issues in the references, e.g.:
 [1], [2], [4], [7], [18], [23]... capitalisation issues in ATLAS, LHC, QCD, whole title, ABCD, etc.
 [9] Different authorname formats from the rest
 [23] T. A. Collaboration, [24] S. D. Team, ... should be proper collaboration names
 [20], [21], [22]: missing proper experimental collaboration names
 [21] s > \sqrt{s} etc. and generally fix title formatting
Requested changes
See report
Recommendation
Accept in alternative Journal (see Report)
Strengths
1. It is always interesting to see a new idea to solve longstanding challenges in LHC physics;
2. Combining distributions and fractions for several jets at a time might be useful for wellseparated jets;
3. The papers comes with code, so these methods can be tested.
Weaknesses
1. I am missing a welldefined case for a combined analysis of many jets, this is assumed, but the reality is much more complicated than even numbers of heavy flavors;
2.
Report
I generally like new ideas for longstanding problems, and the paper presents a new, global approach to flavor tagging. Learning fractions and distributions at the same time is also not unrealistic, and the paper nicely illustrates how the problem can be tackled using a set of standard numerical approaches. In that sense, the paper is interesting and could eventually be published.
However, I see a series of shortcomings:
1. It would be nice to have an actual use case, the reality of jet tagging in events is much more complicated than an even number of heavy flavors;
2. Following a proper use case, the results could also be presented in a physics context. For a journal like SciPost I do not consider pure toy applications without physics interpretation appropriate;
3. In Sec.2 the four different approaches are defined, but it is not clear how they are selected and why these ansatzes have been chosen, please clarify and maybe also specify expected strengths and weaknesses motivating this selection;
3. The rest of the paper is very textheavy and hard to read. It would help to maybe split up the figures and align them with the text, so one does not always have to flip to the back of the paper.
4. I am not sure if I am missing something, but it would be nice to discuss in more detail the impact of the prior and how this Bayesian approach works in Sec.3;
5. Along the same lines, is the Bayesian approach guaranteed to converge or lead to a stable result? It is not clear to me if I should deduce this from the shown results;
6. So what is the bottom line? Does the method work? Better than what? Or does it need additional work? The paper leaves the reader a little in the dark.
Altogether, I think the paper needs a more realistic use case/example/bottom line to be really helpful as a physics paper. The idea is very exciting, though, and I am looking forward to some more quantitative results.
Recommendation
Ask for major revision
Strengths
1. The paper presents a novel statistical approach for going beyond fixed workingpoint jet flavour taggers when analyzing multicomponent and multijet event datasets.
2. The ideas presented in this work have multiple potential generalizations and applications to other similar problems in high energy particle phenomenology.
Weaknesses
1. The methods are tested on toy examples and without direct comparison to existing established approaches. Consequently, their performance and viability in realistic scenarios is difficult to evaluate.
Report
In the submitted manuscript, the authors explore a bayesian statistical method to infer mixtures of processes producing events with multiple quarkflavoured hadronic jets without relying on fixed working point jettaggers. The method aims to extract the weights of the process components as well as the jet discrimant performances insitu. It leverages the cooccurrances of different jet flavour combinations, the knowledge of possible truthlevel flavour multiplocities predicted by individual component processes, as well as general assumptions on the smoothness and unmodality of physical jet flavour discriminants. The later are implemented through a choice of different Bayesian priors leading to several distinct statistical models with varying prediction and computational performances.
The presented research is topical and potentially highly relevant for experimental collaborations at the LHC. The approach also has potential to be developed further and applied to other similar datasets using continuous discriminants as part of object/event selection.
However in its present form, the paper fails to convincingly demonstrate the full potential of the method as well as pinpoint all of its potential weaknesses. Thus I would ask the authors to fully address the issues listed below before making my recommendation.
Requested changes
1. The authors should quantify the inference power of their method by comparing it to a more traditional cutbased approach.
2. In the toy example the "truthlevel" fractions of the two process admixtures are both reasonably large. In practice however, one often deals with signal to background ratios which are orders of magnitude below unity (before applying jet flavour discrimination). It is not clear how the proposed method would perform in this kind of scenarios.
3. From the existing discussion it is also not clear how the different models' computational and discriminative performance scales with the dimensionality of the problem, both in terms of the number of jets, number of events and number of admixtures. Thus it is difficult to evaluate its applicability to other, real world scenarios.
4. More specifically on the last point: From figures 3 & 4 it seems that for low number of events (N=100), the posterior distributions of admixture fractions are significantly inconsistent with their true values, thus exhibiting bias and potentially putting the method's robustness in question.
5. In their current approach, the authors choose to bin the otherwise continuous discriminant distribution to build tractable statistical models. It is not clear how essential this simplification is and if there are ways to model the full continuous discriminant response (perhaps using techniques similar to PDF extraction by the NNPDF collaboration).
6. The labeling of true and prior cfractions in Figure 2 are confusing. Why is the later drawn as a horizontal line? Shouldn't it be a (flat?) distribution? Then I would suggest a different visualization, to make this more obvious.
7. The text contains several typographical and grammatical errors as well some possibly incorrect statements. Some examples to be revised a listed below:
a) Second sentence on page 2 is over the top. To be removed or revised.
b) Page 2, line 9: expression "mutual information" is not clear. It may or may not refer to the formal, technical term. To be revised.
c) Page 2, paragraph 3, first line: Acronym HEP is never used. To be removed.
d) Page 9, 2nd paragraph of Sec. 3.3, line 3: the term "hack" is inappropriate in this context. To be revised.
d) Page 9, 3rd paragraph of Sec. 3.3: the last sentence is grammatically incorrect and should be revised.
e) Page 10, last paragraph above Sec. 4: the expression "light statistical analysis" is vague. It should be revised and made more precise.
f) Page 10, last paragraph above Sec. 4. The meaning of the next to last sentence starting with "We find that the Unimodal..." is unclear. It should be revised.
Recommendation
Ask for major revision