SciPost logo

SciPost Submission Page

Choose Your Diffusion: Efficient and flexible ways to accelerate the diffusion model in fast high energy physics simulation

by Cheng Jiang, Sitian Qian, Huilin Qu

Submission summary

Authors (as registered SciPost users): Sitian Qian
Submission information
Preprint Link: https://arxiv.org/abs/2401.13162v1  (pdf)
Date submitted: 2024-01-26 01:39
Submitted by: Qian, Sitian
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • High-Energy Physics - Experiment
  • High-Energy Physics - Phenomenology
Approaches: Experimental, Computational, Phenomenological

Abstract

The diffusion model has demonstrated promising results in image generation, recently becoming mainstream and representing a notable advancement for many generative modeling tasks. Prior applications of the diffusion model for both fast event and detector simulation in high energy physics have shown exceptional performance, providing a viable solution to generate sufficient statistics within a constrained computational budget in preparation for the High Luminosity LHC. However, many of these applications suffer from slow generation with large sampling steps and face challenges in finding the optimal balance between sample quality and speed. The study focuses on the latest benchmark developments in efficient ODE/SDE-based samplers, schedulers, and fast convergence training techniques. We test on the public CaloChallenge and JetNet datasets with the designs implemented on the existing architecture, the performance of the generated classes surpass previous models, achieving significant speedup via various evaluation metrics.

Current status:
In refereeing

Reports on this Submission

Anonymous Report 1 on 2024-5-1 (Invited Report)

Strengths

1: Consideration of multiple solvers for diffusion generative models

Weaknesses

1: The text would greatly benefit from a revision. The discussions of the results are hard to follow.
2: The authors aim to have a comprehensive comparison of samplers, but fall short on being through with their studies. How different distributions change with different sampling steps? How their results compare against public results on the same dataset?
3: Even though the JetNet dataset is mentioned, the studies performed using the dataset are considerably limited compared to the previous sections, lacking any conclusive results.

Report

The authors aim to provide a comprehensive comparison between samplers for diffusion generative models. They use 2 public datasets to evaluate the differences between samplers. The calochallenge dataset 2 and the JetNet dataset. While the choice of samplers covers a comprehensive number of modern and traditionally used samplers, the results are hard to follow. In particular, in almost no plot all samplers are shown simultaneously, with only an arbitrary subset chosen for each distribution. This issue is more acute in the JetNet dataset where almost no effort is used to quantify the differences between solvers. Additionally, the authors do not compare their results with public benchmarks, undermining their goal of establishing the choice of better solvers for specific tasks. The goal of comparing multiple solvers is indeed interesting and deserves to be published, but in the current form additional studies and textual improvements are necessary. More detailed feedback is given below.

P2: “As a consequence, the detector simulations for the modern high granularity detectors occupy the most computation resources. ” Citations to support this statement would be great.

P3: “The performance surpasses the previous model by several evaluation metrics.” Where is the evaluation metrics?

P5: “ODE solvers often have smaller discretization errors than SDE solvers in smaller sampling steps.” Is there a reference to support this statement? ODE solvers often require less steps than SDE solvers, which contradicts the argument of bigger time steps given.

P5: VP scheduler: the variance preserving property is determined by the relationship alpha^2 + sigma^2 = 1 and not by the time evolution of sigma. I would point to it when introducing to the VP schedule or changing the name to avoid confusion (for example, cosine is also a VP schedule).

P5:
EDM schedule:

- What does churn means in this context? the meaning of S_{churn} is given but would be nicer to define an acronym related the meaning of the acronym.
- “This SDE sampler, unlike other SDEs that only give satisfied result after a very large sampling steps, can converge quickly because of this” satisfying instead of satisfied. What does “this” refer to in the explanation?

Eq8:

- How is the score function of the data x dependent on time? This result is only true for sigma = t, which is only one of the possible schedulers discussed. How is that implemented for the other schedulers?
- What is D?

Eq:9

- How Eq.8 gives you Eq. 9? What if F?
- Define the noise epsilon and the relationship with the noise applied to the data.

P8: “To mitigate the instability of the optimization over the training, …” why is the training unstable to begin with?
How is SNR(t) defined?

Eq. 10: What is the benefit of the function being even? Are negative time values used at any point?

P9: “In addition, a DNN classifier with 2 layers, each comprising 2048 nodes and a 0.2 dropout
rate” how is that determined to be sufficient?

Fig. 2: The authors claim the benefit of the weight function based on the convergence results as a function of the number of training epochs. This argument is unfortunately not sufficient to prove their point as the energy ratio is but a single observable from a high dimensional dataset. Moreover, additional parameters such as the choice of optimizer, learning rate, and batch size will all influence the convergence rate independently from the choice of weighting scheme. Additionally, faster training convergence is a debatable quantity for a generative model, as the main benefit of fast detector simulation is at inference time, with training time corresponding to a small fraction compared to the expected inference time during production.

Fig. 3: The ratio plots should be zoomed in as currently the axis range is too big compared to the plot. The choice of plots is also odd as other schedulers were also discussed in the previous sections. The same plot with all schedulers shown at the same time with zoomed in axis in the ratio plot would be better to compare the differences in generation quality.
Similarly, the number of steps chosen for each scheduler seems arbitrary at this point. How were they chosen?

Figs 3, 4, 5: Again, even though multiple solvers are described, the authors only show results for an arbitrary subset. Either show the results for all samplers, or motivate why the EDM is preferred in these plots.

Fig. 6: Why the ratio is not shown? This is the first distribution showing a bigger set of schedulers and would benefit from the ratio plot. Why EDM is shown with different number of steps? Would the other samplers also improve with more steps? For example, LMS shows a disagreement at low voxel energies, but uses only 36 steps. Similarly to my previous question, the authors should motivate how the choice of steps shown in the comparison plots are motivated, otherwise differences cannot be attributed to the solvers but simply from a poor choice of number of steps.

P10: “Indeed some of them are struggling to match the low voxel energy, the presence of the tail is probably a consequence of model itself and too low energy threshold”. What does that mean? That the model itself is not good enough? If so, then no sampler should be able to get a good agreement in the low energy voxel region, which is not true from Fig. 6.

P11: “LMS sampler involves an additional parameter "order" of the coefficient which makes the generation time longer as it increases”. This sentence is very cryptic as that parameter has not been introduced nor is it explained how it influences anything in the solver.

Fig. 7: Why LMS seems to increase instead of decrease with more steps? This plot and results would be great to show early in the text to motivate the choice of sampling steps picked for individual histograms (if that is true that the number of steps were chosen based on this plot).

Similarly, plots showing, as a function of the number of steps and or each sampler, distributions such as the chi-square or EMD for the 1-dimensional histograms shown before would be a great way to compare the samplers.

P12: “First, much faster convergences have been seen from all new introduced samplers” in the context of this paper, all samplers are new. Please be more specific about the samplers referred to.

Fig. 8: How is separation power defined?

Fig. 9: Again, a ratio plot would be beneficial to aid the discussions on the differences observed between samplers. How many steps is high EMD steps?

P12: “This is crucial for us to perform accurate energy calibration from low-level fast calorimeter simulation later.” I’m missing how the previous discussion reaches this conclusion.

Table 1: What bold entries mean? The best results? In the AUC column, there are lower AUC and FPD values than the ones shown in bold. Uncertainties from multiple runs should also be shown for each metric to identify when differences are actually significant.

P14: “We choose Karras and Lu schedulers to illustrate the impacts of different noise schedulers
on the same samplers.” Why this choice of samplers? Where is this illustrated? The following discussion is very hard to follow without any visual aid.

P16: The jetnet results are incredibly short compared to the calorimeter results. How the sampling quality changes in this case versus the number of steps used? How the values you obtain compare with the many public results on the jetnet dataset?

“It may be because methods are more applicable to UNet and pixelated data than point clouds network.” Why would it be the case? Which studies were performed to reach this conclusion?

Recommendation

Ask for major revision

  • validity: ok
  • significance: ok
  • originality: low
  • clarity: poor
  • formatting: reasonable
  • grammar: mediocre

Login to report or comment