SciPost Submission Page
What's Anomalous in LHC Jets?
by Thorsten Buss, Barry M. Dillon, Thorben Finke, Michael Krämer, Alessandro Morandini, Alexander Mück, Ivan Oleksiyuk, Tilman Plehn
This is not the latest submitted version.
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Thorsten Buss · Barry Dillon · Thorben Finke · Tilman Plehn 
Submission information  

Preprint Link:  https://arxiv.org/abs/2202.00686v2 (pdf) 
Code repository:  https://github.com/IvanOleksiyuk/jetkmeans 
Date submitted:  20220216 11:05 
Submitted by:  Buss, Thorsten 
Submitted to:  SciPost Physics 
Ontological classification  

Academic field:  Physics 
Specialties: 

Abstract
Searches for anomalies are the main motivation for the LHC and define key analysis steps, including triggers. We discuss how LHC anomalies can be defined through probability density estimates, evaluated in a physics space or in an appropriate neural network latent space. We illustrate this for classical kmeans clustering, a Dirichlet variational autoencoder, and invertible neural networks. For two especially challenging scenarios of jets from a dark sector we evaluate the strengths and limitations of each method.
Current status:
Reports on this Submission
Anonymous Report 2 on 2022531 (Invited Report)
 Cite as: Anonymous, Report on arXiv:2202.00686v2, delivered 20220531, doi: 10.21468/SciPost.Report.5154
Report
This paper compares a number of unsupervised machine learning methods for anomaly detection when applied to high energy jets with modified showers. This is a serious study that should be published. The technical details (Sec. 25) are solid and I would only have minor comments. However, I am concerned about the content of Sec. 1 and 6 and before I would be able to recommend publication, I think those sections need a significant rewrite. Instead of going line by line, below is a list of some of the main issues with these sections. I would be happy to review a new version of the manuscript for publication in SciPost Physics.
 The LHC is for more than just looking for new physics.
 The authors seem to conflate anomaly detection and unsupervised methods. Said another way, the authors seem to equate anomalies with low density p(x). Not all anomaly detection methods look for events with low p(x). In fact (as is even pointed out), most proposed BSM models result in events that are not the lowest p(x) (which is instead the tail of the SM).
 Related: p(x) is not coordinate invariant. This means that looking for events with low p(x) depends on what coordinates are used to represent x. For a given coordinate system, there is a right answer for which events have low p(x). One anomaly detection method could appear better than another because (a) it is better at estimating p(x) or (b) because it happens to assign (implicitly or explicitly) low p(x) to the target signal(s). In the current paper, only ROC curves are used, so we can't know if one method is better on purpose or "by accident".
As I said, I think the study itself (Sec. 25) is a useful and important addition to the literaure, but I can't yet recommend publication until Sec. 1 and 6 are significantly modified.
Anonymous Report 1 on 202243 (Invited Report)
 Cite as: Anonymous, Report on arXiv:2202.00686v2, delivered 20220402, doi: 10.21468/SciPost.Report.4847
Strengths
1 Thorough exploration of various anomaly detection strategies at the LHC
2 Clearly written
Weaknesses
1 Anomaly detection methods have okay but not great performance and don't seem to generalize well
2 Proposed method for anomaly detection is just implicit hypothesis testing in a convoluted language
Report
[This submission does not meet the criteria of SciPost Physics, but does meet those of SciPost Physics Core, where it could be published.]
This paper explores various strategies for anomaly detection at the Large Hadron Collider (LHC). Using a case study based on jets produced from "hidden valley" models, the authors perform a test of three different anomaly detection paradigms: kmeans clustering, Dirichlet variational autoencoders, and invertible neural networks. The authors conclude that no one anomaly detection scheme is optimal for identifying both benchmark scenarios, and that the choice of architecture and inputs has a large impact on the anomaly detection efficiency.
This manuscript is thorough and clearly written, and with minor changes, it would definitely meet the standards for SciPost Core. In particular, it will help the reader understand the many possibilities and pitfalls in anomaly detection methods, which is a growing research area at the LHC and elsewhere.
I do not, however, believe that manuscript meets the standards of SciPost. SciPost aims to publish papers with groundbreaking results, whereas this work is an exploration of methods that work okay but not great for anomaly detection and don't seem to generalize particularly well to different signal models. Moreover, the fundamental premise of this research is based on an inexact statement that, when turned into exact equations, reveals that this class of anomaly detection strategies is actually implicit hypothesis testing, as I now explain.
The key point that the authors make in this paper is that what defines an anomaly search at the LHC is finding events which lie in a lowdensity region of phase space. This statement seems sensible at the level of words, but at the level of equations it needs clarification. Let p(x) be the empirical probability density, estimated from the observed data. Presumably, the authors call an event x0 anomalous if
p(x0) < t
where t is some threshold. This equation is not invariant to coordinate transformation, though, since the left hand side is a density and the right hand side is a scalar. But we can make it invariant by letting u(x) = 1/Vol be the uniform probability density in the x coordinate system, and Vol is the volume of the x space, in which case we have:
p(x0) < t * u(x0) * Vol
This is now a proper equation that we can perform coordinate transformations on and the Jacobian factors will cancel. Rewriting this expression as
1/(t*Vol) < u(x0) / p(x0)
we see that the right hand side is a familiar likelihood ratio, assuming u(x) can be normalized. Via the NeymanPearson lemma (and a little algebra to deal with the fact that p(x) is a mixture of signal and background), this defines the optimal observable to identify a signal whose distribution is u(x)! Therefore, lowdensitybased anomaly detection is asymptotically equivalent to hypothesis testing, where the proposed signal has distribution u(x).
Using the above logic, it is now clear why such different performance was obtained for the different methods, since depending on the inputs and architectures, the implicit u(x) could be quite different. Perhaps the authors think that an implicit u(x) is superior to hypothesis testing with an explicit signal model, but this reviewer is skeptical of the utility of such methods. For these reasons, I do not think this manuscripts meets the high threshold for publication in SciPost.
For SciPost Core, I recommend that the authors make the enumerated clarifications below prior to publication.
In summary, this paper is a valuable addition to the literature and suitable (with minor revisions) for SciPost Core. But the authors have not really delivered on the promise of their title in a satisfying way. The answer to the question "What's Anomalous in LHC Jets?" is apparently "Jets that are uniformly distributed in some reference coordinate system, as determined implicitly by a machine learning algorithm." This is perhaps a true statement, but not one that rises to the SciPost standard of a groundbreaking result.
Requested changes
0 Transfer to SciPost Core
1 The authors state on page 3 that "some of our colleagues even seem to lack trust in these models altogether". Is this a reference to machine learning models or to hidden valley models? I assume the latter, in which case this reads like a strange statement, since why would should someone "trust" a model that has thus far not passed any experimental tests (other than not being observed)?
2 The datasets used for modeling the background should be described in Section 2.1.
3 For the Heidelberg dataset, it is unclear why this production mode would lead to a boosted topology. Does the pT cut in eq. (3) mean that there is copious initial state radiation that pushes the dark quarks to be in the same jet cone, or does it make the dark quarks have a large pT individually, such that their decays products yield a fat jet?
4 At the beginning of section 3, the authors say that the computation cost of kmeans scales linearly with the number of data points and dimensions. This is not true. Finding the optimal kmeans solution is an NPhard problem. Lloyd's algorithm scales in the way the authors say (if one ignores worst case complexity). This should be clarified.
5 In eq. (11), the authors should clarify what the sum over j includes. I assume it does not include all particles, just those in the cluster.
6 On page 16, the authors should define what SeLU is.
Author: Thorsten Buss on 20220711 [id 2650]
(in reply to Report 1 on 20220403)
We thank the referee for their comments which have contributed to a sizeable improvement in the manuscript. However we disagree that the results presented here do not meet the standards of SciPost Physics. The referee states that the techniques presented "work okay but not great", but this is not true. Some of the results in tagging anomalous top jets show stateoftheart performance and show the ability to identify low complexity anomalies such as dark jets. The efficiencies and background rejections achieved for the dark jets do not look as impressive when viewed alongside the results for QCD and top jets, but this is only because the dark jets signal is a much more difficult signal to identify. Another key result is the study of the dependence of reparameterisations of the data, which results in a change in the density and thus a change in the definition of what an anomaly is. We set ourselves the challenge of identifying these anomalous dark jets both from the lowlevel collider data and from highlevel EFP observables, for a selection of very different anomaly detection tools, and we were successful in doing so. We were also able to draw comparisons between the different anomaly detection methods on different signal types.
We would like to thank the referee for the indepth discussion on the relation between the densitybased definition of anomalies and the optimal observable for the same anomaly in terms of a likelihood ratio. We completely agree with this. In fact one could express our anomaly score as a likelihood ratio where we assume a flat likelihood for the signal, i.e. L=p_s(x)/p(x), with p(x) being the density of the data itself or even the background, and p_s(x)=1 being the assumed flat likelihood for the anomalous signal. Under a coordinate transformation the hypothesis of a flat likelihood changes and we will have a likelihood ratio which performs better or worse in the anomaly detection task. There are a number of ways to elaborate on this, and in the draft we explained this solely in terms of the background density p(x).
Requested change: "The authors state on page 3 that "some of our colleagues even seem to lack trust in these models altogether"."
Reply:
The comment is not about Hidden Valley models in general, but rather it's about the modelling of the dark sector showering implemented in the Hidden Valley module in Pythia. This sentiment is widely reflected in the community, as it is difficult to tune showering and hadronisation models in a dark sector that we have not detected. We agree that this isn't very clear from how the comment has been written, so we have rephrased this as "there are also reasons to doubt that the dark sector showering modelled in Pythia is accurate due to differences between the strong sector in the SM and in the dark sector.".
Requested change: "The datasets used for modeling the background should be described in Section 2.1."
Reply:
We agree, and the following sentences have been added to Sec 2.1:
"The light QCD background jets are simulated using MadGraph5 to obtain dijet events and Pythia8.2 for showering and hadronization"
"A background which we do not consider here arises from detector malfunctions such as dead cells, however this is not modelled by Delphes so we are unable to implement it here. Nevertheless this does not alter the core results of the analysis."
Requested change: "For the Heidelberg dataset, it is unclear why this production mode would lead to a boosted topology."
Reply:
It's not guaranteed that the production mode leads to a boosted topology, and this makes the search more difficult. However we do check that in many of the cases the decay products of the dark quarks end up in the same jet. To clarify this we have added the following sentence to Sec 2.1:
"Although these parameters and cuts do not guarantee that all decay products of the Heidelberg dark quarks end up in the same jet, in many cases they will."
Requested change: "At the beginning of section 3, the authors say that the computation cost of kmeans scales linearly with the number of data points and dimensions. This is not true."
Reply:
We have added a sentence on the specific kmeans algorithm we use, along with the appropriate citations:
"Loyd’s kmeans algorithm [85] scales linearly with the number of data points and dimensions for each iteration. Since it usually converges quickly [86] (in our application ∼ 300 iterations are sufficient for convergence), we can apply it to large datasets with highdimensional data."
Requested change: "In eq. (11), the authors should clarify what the sum over j includes. I assume it does not include all particles, just those in the cluster."
Reply:
We have added a sentence below Eq. (11) to clarify this: "(with j iterating over the vectors assigned to cluster i)".
Requested change: "On page 16, the authors should define what SeLU is."
Reply:
We have now added a citation to the original paper explaining SeLU.
Author: Thorsten Buss on 20220711 [id 2651]
(in reply to Report 2 on 20220531)Requested change: "The LHC is for more than just looking for new physics."
Reply:
We agree with the referee that our comments here may have been a bit strong, so we have edited them as suggested. In the abstract we have replaced the sentence "Searches for anomalies are the main motivation for the LHC and define key analysis steps, including triggers" with "Searches for anomalies are a significant motivation for the LHC and help define key analysis steps, including triggers". And in Sec 6 we have replaced the phrase "hints for physics beyond the Standard Model are at the heart of the LHC program" with "... are a large part of the LHC program".
Requested change: "Not all anomaly detection methods look for events with low p(x)"
Reply:
We agree with the referee here, as we have neglected to describe anomaly detection methods such as CWoLa bump hunting which do look for anomalies which produce overdensities in some distribution. We also agree that not all BSM signals are anomalies, and we have not stated this anywhere in the paper. We have added the paragraph below to Sec 1:
"This definition could also be extended to weakly supervised techniques where the signal is localised in some global observable like the invariant mass of the events, where now the background phase space distribution would need to be inferred through sideband methods. Classification Without Labels (CWoLa) [...] methods go even further and learn a likelihoodratio classifier for an anomalous signal assumed to be localised in a specific phase space region."
We do also state in Sec 6 that density estimation is the basis of MLbased anomaly searches at the LHC, and this is true both for outofdistribution anomalies and for overdensities, where in the latter case the ANODE and CATHODE methods use density estimation heavily in their approaches.
Requested change: "p(x) is not coordinate invariant."
Reply:
We completely agree, and this was indeed one of main conclusions and take aways from this work. With the dark jets examples we explicitly intended to check how modelagnostic we could be for such a tricky signal type. This may not have been clear, so we have added this sentence to the abstract ", and discuss the modeldependence in choosing an appropriate data parameterisation". This has already been pointed at the end of Sec 1 with the sentence "a sizeable dependence on the preprocessing of the respective datasets, specifically the reweighting of the inputs.", and in the conclusions we elaborate by saying "Preprocessing is important for all of them, and has a very significant impact on the performances. This is because we define the anomalies as those jets that lie in low density regions of physics space, and the preprocessing alters this density, therefore it changes how the anomalies are defined".
We thank the referee for their comments and hope that the changes we have made meet their requirements.