SciPost Submission Page
Inferring correlated distributions: boosted top jets
by Ezequiel Alvarez, Manuel Szewc, Alejandro Szynkman, Santiago Tanco, Tatiana Tarutina
This is not the latest submitted version.
Submission summary
| Authors (as registered SciPost users): | Manuel Szewc · Santiago Tanco · Tatiana Tarutina |
| Submission information | |
|---|---|
| Preprint Link: | scipost_202505_00054v1 (pdf) |
| Code repository: | https://github.com/ttarutina/BoostedTopJets |
| Date submitted: | May 26, 2025, 4:48 p.m. |
| Submitted by: | Tatiana Tarutina |
| Submitted to: | SciPost Physics Core |
| Ontological classification | |
|---|---|
| Academic field: | Physics |
| Specialties: |
|
| Approach: | Phenomenological |
Abstract
Improving the understanding of signal and background distributions in signal-region is a valuable key to enhance any analysis in collider physics. This is usually a difficult task because -- among others -- signal and backgrounds are hard to discriminate in signal-region, simulations may reach a limit of reliability if they need to model non-perturbative QCD, and distributions are multi-dimensional and many times may be correlated within each class. Bayesian density estimation is a technique that leverages prior knowledge and data correlations to effectively extract information from data in signal-region. In this work we extend previous works on data-driven mixture models for meaningful unsupervised signal extraction in collider physics to incorporate correlations between features. Using a standard dataset of top and QCD jets, we show how simulators, despite having an expected bias, can be used to inject sufficient inductive nuance into an inference model in terms of priors to then be corrected by data and estimate the true correlated distributions between features within each class. We compare the model with and without correlations to show how the signal extraction is sensitive to their inclusion and we quantify the improvement due to the inclusion of correlations using both supervised and unsupervised metrics.
Current status:
Reports on this Submission
Report #3 by Anonymous (Referee 2) on 2025-7-30 (Invited Report)
- Cite as: Anonymous, Report on arXiv:scipost_202505_00054v1, delivered 2025-07-30, doi: 10.21468/SciPost.Report.11669
Strengths
2 - The focus on boosted tops is phenomenologically relevant.
3 - The authors show clear gains in classification when correlations are properly modeled.
4 - The approach is relatively interpretable compared to black-box ML methods.
Weaknesses
2 - The method relies on the extraction of the transfer correlation matrices from MC simulations. This means that it is strongly sensible to improvements in MC modelling.
3 - The paper comes together with the GitHub repository with the code. The code is not well documented, and the requirements are not declared anywhere.
Report
The method is valuable, and therefore it is suited for publication in this journal.
However, I have some question and request to submit to the authors before the publication. These are listed in the next section of the report.
Requested changes
1 - Update the repository, providing a requirement list (just do pip freeze > requirements.txt) to run the code.
2 - Define explicitly what do you mean by the $N_{clust}$ distribution of a jet (is this the particle multiplicity, or something different?).
3 - Elaborate more on the prior assumption $\pi_1 = 0.3$. What's the minimum value for $\pi_1$ to do not affect too much the inference performance? Are these values relevant for phenomenology? Is this affected by the value of $\Sigma$?
4 - Explain how the model change in case true values of the parameters (in thsi case $\pi_1$) are not available.
Recommendation
Ask for minor revision
Report #2 by Anonymous (Referee 2) on 2025-7-30 (Invited Report)
- Cite as: Anonymous, Report on arXiv:scipost_202505_00054v1, delivered 2025-07-30, doi: 10.21468/SciPost.Report.11667
Strengths
1- The presentation of the correlated mixture model is mathematically sound and clearly motivated. The use of EM and correction for simulation mismatch is well-explained. 2- The focus on boosted tops is timely and relevant. 3 - Interpretability: The approach is relatively interpretable compared to black-box ML methods
Weaknesses
1 - The paper refers to a GitHub repo, however the code is not well documented. I directly to run the code myself in a fresh environment, but the instructions are incomplete. Essentially, no dependencies are indicated. 2 - The assumption that the extracted transfer correlation matrices from MC simulations are exact is phenomenologically relevant. The phenomenological impact of such analyses relies on MC event generator modelling improvements.
Report
The method is valid and the case of study is relevant. Thus, the paper is suitable for publication in this journal, but only after some changes are applied.
Requested changes
1 - explicitly specify in the text what do you mean by $N_{clust}$ distribution of the jet (is that particle multiplicity?) 2- provide requirements in the GitHub repository to correctly run the code (e.g. just do "pip freeze > requirements.txt"). It would be nice to provide a docker image with a working installation for the code, but this is optional. 3 - It would be interesting to argument more on the minimum value of prior $\pi_1$ that doesn't compromise the inference performances, for example for the loose priors scenario. Elaborate more on the potential phenomenological impact of the assumption done in the paper.
Recommendation
Ask for minor revision
We thank the reviewer for the very thorough report. We address the comments below and in the main text.
-"explicitly specify in the text what do you mean by Nclust distribution of the jet (is that particle multiplicity?) ."
We mean the number of clusters found by re-clustering the constituents of the jet with a smaller radius. We have included this definition in the introduction.
-"provide requirements in the GitHub repository to correctly run the code (e.g. just do "pip freeze > requirements.txt"). It would be nice to provide a docker image with a working installation for the code, but this is optional. "
We updated the repository, providing a requirement list as requested by the referee.
-"It would be interesting to argument more on the minimum value of prior π1 that doesn't compromise the inference performances, for example for the loose priors scenario. Elaborate more on the potential phenomenological impact of the assumption done in the paper."
The referee is correct in that, for benchmarking purposes, we have settled on a $\pi_1$ that is somewhat arbitrary, with an additional choice of a uniform prior in $\pi_1$ that is independent of $\Sigma$ which is only applied to the feature priors. This is a somewhat pessimistic choice, since we expect to possess fairly large and enriched samples of top-jets, since we can use the same $t\bar t$ samples used to calibrate current state-of-the-art top-taggers. We have actually chosen the example of an unbalanced, more background-enriched sample in order not to make the problem too easy in that sense. Because of this, we have not explored the lowest sensitivity on $\pi_1$. We have expanded on this before the start of Section 3.1.1.
Report #1 by Anonymous (Referee 1) on 2025-7-26 (Invited Report)
- Cite as: Anonymous, Report on arXiv:scipost_202505_00054v1, delivered 2025-07-26, doi: 10.21468/SciPost.Report.11643
Strengths
1- The explanation of the Bayesian framework is clearly written and includes a detailed discussion of the logical steps needed to include correlations in a multinomial mixture model.
2- The inference process with accounted correlations outperforms the assumed conditional independence counterpart, demonstrating the capabilities of the method.
Weaknesses
1- The proposed method assumes that the transfer correlation matrices extracted from simulations are exact, which might not be true or negligible in precision measurements.
2- Dependence on the prior distribution used for $\alpha^k$ and $\beta^k$ is significant, therefore posing the question of how to choose the correct prior in a real inference scenario.
Report
The method is applied to the inference of the fraction of boosted top jets in a mixed sample using two categorical variables: the number of clusters $N_\text{clus}$ and the binned mass of the jet.
The method is valuable and is well-suited for publication in this journal. I have only questions and minor concerns about the practical choices required to perform the inference process. Please see the requested changes section.
Requested changes
1- Can the authors provide insights on how to choose the prior scale $\Sigma$ when true values of the parameters are not available?
2- The validation of the inferred posteriors is limited to the absolute distance between inferred and true parameters and statistical difference as measured by the Kullback-Leibler divergence and the mutual information. Would it be possible to provide a more rigorous statistical analysis of the agreement between the inferred and the true posterior?
3- Fig.8 shows that, even with a loose prior and correlations, the correct probability of top jets is recovered only for a large number of events. Do the authors understand the source of the residual discrepancy at convergence?
Minor comments: 4- on pg.6: have to labeled -> have to label/labeled
5- Fig.7, right-most column: if I understand correctly, the class probability is constrained in $[0,0.5]$. The authors could consider changing the scale on the x-axis to visualize better the distribution in the proximity of the true value.
Recommendation
Ask for minor revision
We thank the reviewer for the very thorough report. We address the comments below and in the main text.
-"Can the authors provide insights on how to choose the prior scale $\Sigma$ when true values of the parameters are not available?"
The referee is correct in that such a discussion was missing and is necessary for data-driven applications. We have included an additional discussion on the choice of $\Sigma$ below Eq. 30, where we detail how a data-driven, even if subjective, choice can be made through prior predictive checks; or a marginalization over possible $\Sigma$ could be performed introducing an additional source of uncertainty.
-"The validation of the inferred posteriors is limited to the absolute distance between inferred and true parameters and statistical difference as measured by the Kullback-Leibler divergence and the mutual information. Would it be possible to provide a more rigorous statistical analysis of the agreement between the inferred and the true posterior?"
We agree with the referee in that the validation metrics considered in this work are not rigorous in the sense a confidence interval is. However, we believe a more involved analysis, which quantifies the agreement between the inferred posterior and the data distribution via the computation of credible intervals or posterior predictive checks is not necessary to assess the usefulness of including correlations and would add unnecessary computational costs. However, future implementations may require such metrics. We have expanded section 2.3 to include this discussion.
-"Fig.8 shows that, even with a loose prior and correlations, the correct probability of top jets is recovered only for a large number of events. Do the authors understand the source of the residual discrepancy at convergence?"
The source of the discrepancy is the residual bias from a still too strong prior. We have expanded on this point in Section 3.1.3.
-"Minor comments: 4- on pg.6: have to labeled -> have to label/labeled"
We thank the referee for pointing out these typos and have corrected them.
-"Fig.7, right-most column: if I understand correctly, the class probability is constrained in [0,0.5]. The authors could consider changing the scale on the x-axis to visualize better the distribution in the proximity of the true value"
The reviewer is correct, $\pi_{1}$ is constrained in [0,0.5]. We have changed the scale of Figs. 5 and 7 accordingly, and added a clarifying comment in Section 3.1.1.

Author: Tatiana Tarutina on 2025-09-08 [id 5795]
(in reply to Report 3 on 2025-07-30)We thank the reviewer for the very thorough report. We address the comments below and in the main text.
-"Update the repository, providing a requirement list (just do pip freeze $\&$gt; requirements.txt) to run the code."
We updated the repository, providing a requirement list as requested by the referee.
-"Define explicitly what do you mean by the $N_{\mathrm{clust}}$ distribution of a jet (is this the particle multiplicity, or something different?)."
We mean the number of clusters found by re-clustering the constituents of the jet with a smaller radius. We have included this definition in the introduction.
-"Elaborate more on the prior assumption $\pi_1=0.3$. What's the minimum value for $\pi_1$ to do not affect too much the inference performance? Are these values relevant for phenomenology? Is this affected by the value of $\Sigma$?"
The referee is correct in that, for benchmarking purposes, we have settled on a $\pi_1$ that is somewhat arbitrary, with an additional choice of a uniform prior in $\pi_1$ that is independent of $\Sigma$ which is only applied to the feature priors. This is a somewhat pessimistic choice, since we expect to possess fairly large and enriched samples of top-jets, since we can use the same $t\bar t$ samples used to calibrate current state-of-the-art top-taggers. We have actually chosen the example of an unbalanced, more background-enriched sample in order not to make the problem too easy in that sense. Because of this, we have not explored the lowest sensitivity on $\pi_1$. We have expanded on this before the start of Section 3.1.1.
-"Explain how the model change in case true values of the parameters (in this case $\pi_1$) are not available."
Since we perform a posterior inference with unlabelled data, the true values of the parameters are never used in the model. Thus the method and its performance would not change if they were not available. What would change is the evaluation, since only the purely data-driven metrics could be used to evaluate the model, such as the KL divergence between the full posterior and data distributions.