SciPost logo

SciPost Submission Page

Inferring correlated distributions: boosted top jets

by Ezequiel Alvarez, Manuel Szewc, Alejandro Szynkman, Santiago Tanco, Tatiana Tarutina

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users): Manuel Szewc · Santiago Tanco · Tatiana Tarutina
Submission information
Preprint Link: https://arxiv.org/abs/2505.11438v2  (pdf)
Code repository: https://github.com/ttarutina/BoostedTopJets
Date accepted: Oct. 23, 2025
Date submitted: Sept. 8, 2025, 8:01 p.m.
Submitted by: Tatiana Tarutina
Submitted to: SciPost Physics Core
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Phenomenology
Approach: Phenomenological

Abstract

Improving the understanding of signal and background distributions in signal-region is a valuable key to enhance any analysis in collider physics. This is usually a difficult task because -- among others -- signal and backgrounds are hard to discriminate in signal-region, simulations may reach a limit of reliability if they need to model non-perturbative QCD, and distributions are multi-dimensional and many times may be correlated within each class. Bayesian density estimation is a technique that leverages prior knowledge and data correlations to effectively extract information from data in signal-region. In this work we extend previous works on data-driven mixture models for meaningful unsupervised signal extraction in collider physics to incorporate correlations between features. Using a standard dataset of top and QCD jets, we show how simulators, despite having an expected bias, can be used to inject sufficient inductive nuance into an inference model in terms of priors to then be corrected by data and estimate the true correlated distributions between features within each class. We compare the model with and without correlations to show how the signal extraction is sensitive to their inclusion and we quantify the improvement due to the inclusion of correlations using both supervised and unsupervised metrics.

Author comments upon resubmission

Dear Editor,

We hereby resubmit our paper
"Inferring correlated distributions: boosted top jets"
by
Ezequiel Alvarez, Manuel Szewc, Alejandro Szynkman, Santiago Tanco, Tatiana Tarutina.

Sincerely yours,
the authors

List of changes

  1. Introduction, page 2, line 2 from the bottom added text "...obtained from re-clustering the constituent particles of the jet with a smaller R.. "

  2. Section 2.3, page 8, 2nd and 3rd paragraph from above added text " Additionally, we compute the Kullback-Leibler (KL) divergence between the posterior and the data distributions both for each class and for the complete case where no labels are used. The latter case in particular allows us to get a broader picture which does not rely on knowledge of “true” parameters. In all cases, a smaller divergence signals a better agreement. Although the KL divergences provide useful metrics, they are not rigorous statistical comparisons between models. Depending on the access to the true parameters, one could use an array of more involved but powerful techniques. If the true parameters are known, posterior credible intervals could be computed and a coverage study could be performed by bootstrapping the dataset. If one does not wish to rely on access to the true parameters, a posterior predictive check could be performed, which also involves additional samplings under the learned posterior model. Thus, more rigorous tests will in general involve ad- ditional, expensive simulations. In this work, we find that the cheaper metrics provide enough information to justify the need for correlations and validate the specific modeling proposed, without the necessity of more expensive computa- tions. However, future implementations may and probably will need to rely on these well-established suite of metrics to ensure that the learned probabilistic model is appropriate"

  3. Section 3.1, page 10, last paragraph added text "Although the choice of Σ is subjective, it is not arbitrary: the prior needs to encode the trust in the Monte Carlo simulations while providing an adequate range of allowed discrepancies. Because in this work we have access to the true parameter values, we can validate the choice of Σ simply by observing that the resulting posterior can be likelihood-driven and centered around these true values for large enough number of events. When applying this method to data where the true values are unknown, a data-driven prior validation should be implemented. One possibility is to perform prior predictive checks [51] to ensure that the chosen Σ produces realistic datasets. Another possibility is to build a hierarchical model where Σ is promoted to a random variable and marginalized over (see Ref. [52], Chapter 5), reducing the prior dependence at the expense of increased model uncertainty. Since the goal of this work is to validate the introduction of correlations, we leave a more detailed prior exploration for future work."

  4. Section 3.1, page 11, 1st paragraph added two new references [51] and [52]

  5. Section 3.1, page 11, last paragraph of this section added text "... , although we have not explicitly explored the lowest sensitivity of the tagger since we expect fairly enriched samples to be available from t¯t production."

  6. Section 3.1.1, page 13, line 4 from the bottom added text "...with the final π1 ∈ [0, 0.5]."

  7. Section 3.1.3, page 16, line 5 of the 1st paragraph of this section added text "For all cases, we observe how if the number of events is too small the prior dominates and biases the inferred probability. The number of events needed to overcome the prior bias depends on the choice of Σ."

  8. Section 3.1.3, page 16, line 3 of the 2nd paragraph of this section added text "... one should note that the tight prior still biases all distributions even for 10^5 events while the loose prior..."

  9. Section 3.1.1, page 13, Figure 5. changed the scale on the x-axis of the right panels

  10. Section 3.1.2, page 15, Figure 7. changed the scale of the x-axis of the right panels

  11. We updated the repository providing the requirements list.

Published as SciPost Phys. Core 8, 087 (2025)

Login to report or comment