SciPost logo

SciPost Submission Page

SKATR: A Self-Supervised Summary Transformer for SKA

by Ayodele Ore, Caroline Heneka, Tilman Plehn

Submission summary

Authors (as registered SciPost users): Ayodele Ore · Tilman Plehn
Submission information
Preprint Link: https://arxiv.org/abs/2410.18899v1  (pdf)
Code repository: https://github.com/heidelberg-hepml/skatr/
Date submitted: 2024-10-30 15:59
Submitted by: Ore, Ayodele
Submitted to: SciPost Physics
Ontological classification
Academic field: Physics
Specialties:
  • Gravitation, Cosmology and Astroparticle Physics
Approach: Computational

Abstract

The Square Kilometer Array will initiate a new era of radio astronomy by allowing 3D imaging of the Universe during Cosmic Dawn and Reionization. Modern machine learning is crucial to analyze the highly structured and complex signal. However, accurate training data is expensive to simulate, and supervised learning may not generalize. We introduce a self-supervised vision transformer, SKATR, whose learned encoding can be cheaply adapted for downstream tasks on 21cm maps. Focusing on regression and generative inference of astrophysical and cosmological parameters, we demonstrate that SKATR representations are maximally informative and that SKATR generalizes out-of-domain to differently-simulated, noised, and higher-resolution datasets.

Author indications on fulfilling journal expectations

  • Provide a novel and synergetic link between different research areas.
  • Open a new pathway in an existing or a new research direction, with clear potential for multi-pronged follow-up work
  • Detail a groundbreaking theoretical/experimental/computational discovery
  • Present a breakthrough on a previously-identified and long-standing research stumbling block
Current status:
Awaiting resubmission

Reports on this Submission

Report #2 by Anonymous (Referee 2) on 2025-1-9 (Invited Report)

Strengths

The authors provide an interesting application of vision transformer (ViT) that to the best of my knowledge, has not been yet applied to SKA related project. The authors successfully demonstrate the advantages when compared to consolidated machine learning applications as well as the challenges due to the degeneracy of the 21-cm astrophysical parameters.

Report

The paper meets the criteria of the journal. I recommend its publication after addressing some minor corrections and remarks.

Requested changes

1) Connected to the question 3, here below. If galactic and/or extra-galactic foreground contamination is not included in this paper, I suggest adding a few sentences on page 2 after the sentence that ends with "...absent during the pre-training and with noised data" mentioning its absence and stating that, residual foreground contamination (due to the imperfect subtraction) could make the application of ML application more challenging.

2) In the title of section 2 "Lightcones" (delete space)

3) In section 2.1, page 4, the sentence: "The thermal foreground noise estimate considers the 21cm foreground wedge in k-space to cover the primary field-of-view of the instrument" is not clear if the author is including galactic and/or extra-galactic foreground contamination in the analysis or only systematic noise. I suggest rephrasing the sentence and making it clearer as the text appears to suggest that foreground contamination is included when my understanding is that is not.

4) Page 5, in equation (3). I am a bit confused by the notation. The input of equations 3 and 4 shouldn't be the union of the context encoder output: i.e. \tilde{z} and the prediction: i.e. p, so embedding: p \cup \tilde{z}, rather than the input x?

5) Page 5, sentence after equation 3. Is not clear to me what the W1 and W2 are. Are these the weights of the two-layer dense network? also, It would be helpful to remind the reader that dimension 6 is the dimension of the astrophysical parameters.

6) Page 6, sentence after equation 4. Similar question as for equation 3. Are Wk and Wv the weights of this second network? Also, it is not clear the dimension of the learnable parameter: q.

7) Page 6, paragraph starting with "In a ViT, ..." Is not clear to me when and how the two task-specific networks are employed. From the paragraph, the author is using a combination of the two-layer dense network (Eq 3) and a dynamic pooling function (Eq 4), but in the result later only the MLP is mentioned. Moreover, the author mentions that in the pre-training step, the JEPA loss is employed directly in the embedding. Meanwhile, the result related to Figure 3, appears that the MLP is employed to predict the 6 parameters. It would help the reader to include a small paragraph that clarifies when the two aggregation steps are used in the pipeline (during the pre-training or validation, etc.), their task (my understanding is for results in Fig 3), at which stage they have been trained, etc.

8) Page 6, at the very beginning of section 3.2, "Lightcones" (delete space)

9) Page 6, sentence: "During training, an LC (batch) is divided into a set of N patches...". Is this N the same dimension as shown in Figure 1 "Transformer Blocks (xN)"? If not please change to avoid confusion.

10) Page 7, sentence: "Finally, a smaller transformer...". Define what you mean by smaller, are you using the same architecture as the context and target encoder but with different embedding dimensions?

11) Page 7, sentence after equation 9. Maybe I missed it, but what value has been used for tau?

12) Page 13, sentence: "The only exceptions are LX , where the improvement is only marginal, and E0, where SKATR is slightly outperformed". Did the author test if this trend is present also in the LR dataset? if available, it would be interesting to mention in just one sentence.

13) Page 16, I suggest moving Figure 11 to Section 3. This Figure helps to understand what is explained in part 3.1 as it would help to answer my question 7.

14) in section 5: "Outlook". For completeness, there should be a few sentences regarding the absence of residual foreground contamination and other observational systematics, e.g. residual foreground contamination, beam effect, RFI corrupted data, etc. Although I understand this work is focusing on testing the ML application, it is important to remind the readers that the presence of contamination in future 21-cm SKA observations can impact the efficiency of these predictions and remind once again of the importance of taking this into account in future applications to actual observational data.

Recommendation

Ask for minor revision

  • validity: high
  • significance: good
  • originality: high
  • clarity: good
  • formatting: excellent
  • grammar: excellent

Report #1 by Anonymous (Referee 1) on 2024-12-29 (Invited Report)

Strengths

1- The paper provides a satisfactory overview of the challenges in SKA data processing and provides a good motivation to use the SKATR implementation of the vision transformer architecture in lieu of other summary statistics/feature extraction techniques.

2- The authors have addressed a key challenge of SKA data analysis: to provide a maximally informative representation of the large, yet structured SKA data. Moreover, the technique shows reasonable robustness to out-of-domain datasets, an indication of the generalizability of any machine-learning method.

Report

The manuscript does meet the journal criteria and I recommend its publication upon addressing the requested changes.

Requested changes

1- In Sec 2.1, the second sentence: "... leading to LCs with 140 voxels in the on-sky axes ..." can be more explicit in specifying that 140 is the dimension along each on-sky axes. Like "... leading to LCs with 140 voxels along each of the two on-sky axes ...".

2- In Sec 2.1, why are the LR datasets downsampled by a factor of 2.5? It is unclear to me if this is to reduce compute demand or perhaps to average out some of the noise. If so, why isn't the HR dataset also downsampled by the same factor? If this is a misunderstanding on my part, please let me know. Otherwise, I would advise explicitly stating reasons in the section.

3- Sec 2.1, last paragraph is missing the specification for HRDS's train-test-validation datasets.

4- Sec 3.1, last para. Given the statement: "In order to manage the computational cost of a ViT, the patch size should be selected with an expected image resolution in mind", can the authors provide a reasoning behind the specific choice of patch sizes (7, 7, 50) for HR and (4, 4, 10) for LR/HRDS, considering the latter has a constant downsampling factor of 5 about each dimension? If not from a rigorous analysis, at least an empirical reasoning.

5- Figure 6, right: Have the authors explored a post-hoc or post-training calibration (such as Bayesian posterior refinement) of SKATR/ViT to account for the consistently conservative posterior estimates?

6- Unless I have missed this in the text, could the authors please specify the computational facilities (RAM/number/type of GPUs or a rough estimate of total compute) used in the training of SKATR and ViT? - Especially useful so one may better interpret how the training times quoted in Figures 5 & 14 scale with one's native implementation of SKATR, ViT.

7- Appendix C: For the benefit of the reader, could the authors please elaborate on the implication of the bullet points? A short discussion of each figure (Figures 14-20) would suffice.

8- Could the authors please further elaborate on plans for the SKATR with regard to areas of improvement and/or application?

Recommendation

Ask for minor revision

  • validity: high
  • significance: good
  • originality: good
  • clarity: good
  • formatting: good
  • grammar: excellent

Login to report or comment