An NLO-Matched Initial and Final State Parton Shower on a GPU

Michael H. Seymour; Siddharth Sule

SciPost Submission Page

An NLO-Matched Initial and Final State Parton Shower on a GPU

by Michael H. Seymour, Siddharth Sule

Submission summary

Authors (as registered SciPost users):

Siddharth Sule

Submission information
Preprint Link:	https://arxiv.org/abs/2511.19633v1 (pdf)
Code repository:	http://gitlab.com/siddharthsule/gaps
Code version:	v2.0.0
Code license:	GPL-3
Date submitted:	Nov. 27, 2025, 7:32 a.m.
Submitted by:	Siddharth Sule
Submitted to:	SciPost Physics Codebases

Ontological classification
Academic field:	Physics
Specialties:	High-Energy Physics - Phenomenology
Approaches:	Theoretical, Computational, Phenomenological

Disclosure of Generative AI use

The author(s) disclose that the following generative AI tools have been used in the preparation of this submission:

We used GitHub Copilot in Visual Studio Code to help correct programming errors in the GAPS codebase. No AI tools were used for the scientific content or the algorithms, and the manuscript is entirely our own work.

Abstract

Recent developments have demonstrated the potential for high simulation speeds and reduced energy consumption by porting Monte Carlo Event Generators to GPUs. We release version 2 of the CUDA C++ parton shower event generator GAPS, which can simulate initial and final state emissions on a GPU and is capable of hard-process matching. As before, we accompany the generator with a near-identical C++ generator to run simulations on single-core and multi-core CPUs. Using these programs, we simulate NLO Z production at the LHC and demonstrate that the speed and energy consumption of an NVIDIA V100 GPU are on par with a 96-core cluster composed of two Intel Xeon Gold 5220R Processors, providing a potential alternative to cluster computing.

Current status:

Awaiting resubmission

Reports on this Submission

Report #2 by Zoltan Nagy (Referee 2) on 2026-1-22 (Invited Report)

Strengths

1) They investigate the use of GPU. 2) The code is publicly available. 3) Paper is well written.

Weaknesses

1) I feel the studied process and parton shower too simple to get convinced that the GPU computing is feasible and useful for general purpose MC simulations.

Report

The parton shower algorithms and any particle physics simulations are getting more and more complex and a resource hungry. We have to deal with bigger expressions for the hard part of the interactions and the we would like to incorporate more quantum effects in our parton shower algorithms. To archive such a complicated calculation we either use bigger CPU clusters or try to utilise every "transistors" and computing power in our computer. This paper investigate the use of graphical processors (GPU) for parton shower calculations.

In this manuscript the authors demonstrate the use of GPUs with the Drell-Yan process matched at NLO level. They found that running the parton shower on the GPU is as efficient (speed and power consumption) as running on CPUs. This paper can be consider as proof of the concept. The paper is well written and easy to follow, the developed application and library are publicly available. The paper doesn't solve any (new) physics problem, but clearly that is not the purpose of this work.
For these reasons I support the publication of this paper in SciPost.

On the other hand I have some criticism too. I am not an expert on GPU computing but I have a feeling that the implemented parton shower is rather simple. It is a kind of classical parton shower algorithm (no colour beyond the leading color, no spin effect), the only complication is the PDF functions. Furthermore the hard process is also very simple even at NLO level.
I wonder if the GPU calculation is useful when we try to evaluate more complex hard functions, for example when we try to do matching and merging for jet production processes. In that case we have to put much more complex expressions onto the GPU.
Furthermore, I am very much interested in doing color evolution in parton shower algorithms and dealing with color beyond the leading color approximation is rather hard and requires a lot of memory. It would be interesting to see in the future if the GPU can be used also for such a memory hungry processes.

Recommendation

Publish (meets expectations and criteria for this Journal)

validity: good
significance: good
originality: high
clarity: top
formatting: perfect
grammar: perfect

Report #1 by Peter Skands (Referee 1) on 2026-1-5 (Invited Report)

Report

See requested changes.

Requested changes

For cross checks, it would be useful to be able to build and run the CPU and/or GPU codes independently of each other. The README files in the repository do not appear to contain instructions for how to do so. E.g., I do not have an NVIDIA GPU but would still have liked to be able to test that at least the CPU code compiles. This I could not do.

In section 2: on the discussion of selecting the winner. As discussed in more detail in 1605.09246, the veto algorithm can also be run using the sum of trial weights, in that paper called ‘generate-select’. On a CPU, this tends to be faster than the way discussed here especially when the number of emitters is large. Why is this not considered or commented on here? (This more efficient way of generating trials has already, e.g., been used in the Vincia parton shower. I believe it is also used in the PanScales showers, and perhaps others. )

Figure 1: is not clear / lacks explanation. What do the numbers in the boxes mean? What do green or red circles mean? What do dashed lines mean? What do dashed boxes mean? What do ellipses mean? What do vertical rows correspond to? What does “Cycle 1” refer to? The earlier paper [11] does contain more explanation but even there the precise meaning of all graphic elements is still not really clear enough in my opinion. And this paper should be readable without having to refer back to [11] anyway. I note, however, that once Fig 1 has been improved (by improving the explanation and/or by improving the figure) I think Fig 2 is good and clear.

In section 3.1: While it might reduce power consumption, it is not a priori clear to me that running the GPU with fewer kernels would SPEED UP the calculation; what processing effect results in the overall speedup seen from partitioning?

In section 4.1: define the observables mZ, pT, etc, in terms of how they are calculated from the event records. Using which particle(s)?

In section 4.1: It should be made more clear that in the unmatched case, the effect of the Z width is included, while in the NLO one it is not. I accept that this is implicit from the process specifications, but since the arrows in those specifications are not rigorously defined it could remain ambiguous which precise approximation they imply.

Figure 3: phi(Z) distribution. The spike in the middle looks like a misbinning effect? Likewise the drops at the edges. Likewise the phi(Z) plot in Figure 4.

Figure 3 caption: the phrasing "is shown when" is potentially confusing. Rephrase the statements describing the cuts for clarity and conciseness.

Figure 4, first panel: nice to see that the stated delta function is actually what is generated, but does this really require an entire plot?

Figure 4: what is the feature below pTZ ~ 1.5 GeV ? Presumably related to the cutoff, but I could not find any comment / explanation. Also, I note that this plot is in log pT, while the corresponding one in Fig 3 was on a linear scale, making it hard to compare.

Figure 5 caption: the statement "for small NEV there is negligible improvement" is misleading, and indeed negated by the following clause. Rephrase for correctness and conciseness.

Figure 6: here, I would like to see the extent to which linear scaling is or is not manifested. The choice of log x axis distorts that. Suggest to change to execution time PER fixed number of events and equivalently power consumption PER fixed number of events.

Figure 6: it is worthwhile and interesting to consider the energy consumption. Another factor would be the effective price in money. Are the prices to acquire these two systems similar? What about renting them from a service provider for the calculation? These numbers would be interesting to add.

The final paragraph contains too many colloquialisms, which would be fine in a seminar setting, but seem misplaced here.

In appendix A: I think it would be appropriate to cite the original derivation of the backwards-evolution framework for ISR,
https://doi.org/10.1016/0370-2693(85)90674-4

In Appendix B: Together with the first mention of the ARIADNE program, it would also be fair to include the original dipole-antenna shower paper it was based on,
https://doi.org/10.1016/0550-3213(88)90441-5.
And since a fairly comprehensive list of showers is given, I think it would be appropriate to mention that Pythia also implements an option for Ariadne-like antenna showers,
https://arxiv.org/abs/2003.00702.

In Appendix C: Clarify whether this is an “MC@NLO”-style matching, a POWHEG-style one, or something else, along with appropriate references?

Recommendation

Ask for major revision

validity: good
significance: good
originality: good
clarity: ok
formatting: excellent
grammar: good

SciPost Submission Page

An NLO-Matched Initial and Final State Parton Shower on a GPU

by Michael H. Seymour, Siddharth Sule

Submission summary

Abstract

Current status:

Reports on this Submission

Report #2 by Zoltan Nagy (Referee 2) on 2026-1-22 (Invited Report)

Strengths

Weaknesses

Report

Recommendation

Report #1 by Peter Skands (Referee 1) on 2026-1-5 (Invited Report)

Report

Requested changes

Recommendation

Login to report or comment