Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetry

Yang Gao; Phillip Helms; Garnet Kin-Lic Chan; Edgar Solomonik

SciPost Submission Page

Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetry

by Yang Gao, Phillip Helms, Garnet Kin-Lic Chan, Edgar Solomonik

This Submission thread is now published as

Submission summary

Authors (as registered SciPost users):

Edgar Solomonik

Submission information
Preprint Link:	scipost_202209_00050v1 (pdf)
Code repository:	https://github.com/yangcal/symtensor/
Date accepted:	2022-09-27
Date submitted:	2022-09-23 17:37
Submitted by:	Solomonik, Edgar
Submitted to:	SciPost Physics Codebases

Ontological classification
Academic field:	Physics
Specialties:	Condensed Matter Physics - Computational
Approach:	Computational

Abstract

Tensor contractions are ubiquitous in computational chemistry and physics, where tensors generally represent states or operators and contractions express the algebra of these quantities. In this context, the states and operators often preserve physical conservation laws, which are manifested as group symmetries in the tensors. These group symmetries imply that each tensor has block sparsity and can be stored in a reduced form. For nontrivial contractions, the memory footprint and cost are lowered, respectively, by a linear and a quadratic factor in the number of symmetry sectors. State-of-the-art tensor contraction software libraries exploit this opportunity by iterating over blocks or using general block-sparse tensor representations. Both approaches entail overhead in performance and code complexity. With intuition aided by tensor diagrams, we present a technique, irreducible representation alignment, which enables efficient handling of Abelian group symmetries via only dense tensors, by using contraction-specific reduced forms. This technique yields a general algorithm for arbitrary group symmetric contractions, which we implement in Python and apply to a variety of representative contractions from quantum chemistry and tensor network methods. As a consequence of relying on only dense tensor contractions, we can easily make use of efficient batched matrix multiplication via Intel's MKL and distributed tensor contraction via the Cyclops library, achieving good efficiency and parallel scalability on up to 4096 Knights Landing cores of a supercomputer.

Author comments upon resubmission

We are grateful to all reviewers for the feedback, comments, and questions.

Multiple reviewers suggested that the article is more appropriate for SciPost Codebases, and we have resubmitted the article to the Codebases section.

We apologize for linking to the wrong software repository when entering the submission, as well as for the lack of documentation. To improve the software usability and adhere to the SciPost Codebases submission guidelines, we have modernized the build system, added complete documentation for the software, and added automated testing (see https://github.com/yangcal/symtensor and https://solomonik.cs.illinois.edu/symtensor/).

Two of the reviewers suggested the paper would be improved by a complete physics application. Our intent has been to pursue multiple separate application-audience-targeted studies to the efficacy of this approach in different application areas (i.e., coupled cluster methods for quantum chemistry and DMRG-based methods for tensor networks simulation in condensed matter physics). Indeed, we expect such evaluation to be nontrivial and for our method to fare differently depending on context, since our approach incurs an overhead that is dependent on the distribution of symmetry block sizes. However, both students leading this research project have recently completed their PhD and moved on to other positions, so we are currently not in position to expand the scope of work to full-scale simulations.

Responses to Specific Reviewer Comments:

Reviewer 1:

Question 1.1: More detailed discussion of U(1) symmetry, which typically arises in physics context, is missing. If U(1) is not treated approximately, i.e. by some finite (cyclic) subgroup, the number of differently charged blocks is unbounded. Typically, the size of the charged sectors (often labeled N in the manuscript) follows Gaussian distribution with respect to U(1) charge. How does the symtensor approach compare to "loop over block" approach in such case ?

Response 1.1: The current approach can strictly emulate U(1) by choosing a finite cyclic subgroup where the order of the group is so large such that every U(1) irrep encountered in the simulation maps to a distinct irrep of the cyclic group. The efficiency compared to the loop over blocks implementation is rather application specific. In principle, the loop-over-blocks approach can be more computationally efficient as the use of our method requires padding blocks, and if there is a wide distribution of block sizes then this padding is inefficient. However, using small block sizes is inefficient with any approach. Thus, by choosing a smaller cyclic subgroup symtensor may be more efficient than a U(1) approach using the loop-over-blocks implementation.

As discussed in the general response above, however, we would like to leave investigation of full applications / specific symmetries arising in physical systems, to future work. We have added some further discussion of U(1) as well as the padding overhead to Sections 3 and 4.

Question 1.2: The "loop over blocks" method, adopted widely in other libraries, in theory allows for parallelization of the loop since the contractions of individual blocks are independent. Do the comparisons in the manuscript use such parallelization of "loop over blocks" algorithm ?

Response 1.2: We did not employ such a parallelization in our tests. While this approach would be easy to do for a shared-memory parallelization, and would achieve better single-node performance with proper management of load balance, such manual parallelization is much less straight-forward for distributed-memory systems. On the other hand, by using a single large tensor contraction, our approach can use existing libraries for tensor contraction to achieve parallelism on a variety of architectures (shared-memory/GPU/distributed/hybrid) with only a change of tensor contraction backend.

Question 1.3: Why do examples MM_b, MPS_a, MPS_b, PEPS_a, and PEPS_B in Fig. 11
using BLAS perform slightly better than CTF ?

Response 1.3: CTF is executed with 64 processes with distinct memory address spaces, while BLAS is leveraging threading across the node with a single shared address space. Generally, but not always, a thread-based parallelization should outperform a process- (MPI-)based parallelization on a single node. Of course CTF has the advantage of being able to use multiple nodes, enabling larger tensors by partitioning each tensor’s data across many nodes and achieving faster performance. An advantage of our algorithm and software library is that we can simultaneously support both of these types of parallelizations via existing backends.

Question 1.4: (in Minor remarks) I believe beside ITensor and Uni10, the TenPy project is missing among the leading packages for abelian-symmetric tensor network computation frameworks (targeting MPS in this case)

Response 1.4: We have added this reference and addressed the other correction, thank you for these!

Reviewer 2

Question 2.1: In Algorithm 2.2, it's not clear (to me) what the two rows of the coefficient vectors are representing.

Response 2.1: The coefficient vectors are partitioned into two subvectors. One subvector corresponds to coefficients associated with contracted indices, while the other is associated with the uncontracted indices of that tensor. We have added further clarification of this in the pseudocode.

Question 2.2: In Section 3, you write that the library also supports U(1) symmetries. However, in that case there's no reason to assume that the various symmetry blocks have equal sizes, while the presented algorithm assumes this. Can you please comment on that? You mentioned earlier that you just fill up blocks with explicit zeros, so do you significantly loose sparsity in this case?

Response 2.2: Indeed, even for the cyclic subgroups explicitly addressed by the library, there is no reason to assume that the symmetry blocks have equal sizes a priori. Indeed to apply our approach in such cases padding with explicit zeros would be necessary to use the dense tensor representations as we employ. We expect the extent of the overhead to be application/system-dependent and can envision ways to apply our approach in a more sophisticated way with less padding (e.g., bucket blocks by size). However, we would like to leave this investigation for future work.

Thank you also for pointing out the typos, we have corrected these.

Reviewer 3

Question 3.1: Could the author (briefly) explain how to obtain the reduced form of a tensor for a specific problem? It would be useful to add some references.

Response 3.1: As an explicit example, consider a rank-3 tensor T[I, J, K] where the symmetry rule requires that n(I) + n(J) – n(K) = 0 (n(X) denotes the irrep for index X; an explicit example would be where T is the tensor in an MPS simulation, and n represents the particle number index). Assuming each block has equal size, T[I, J, K] can be unfolded into T[n(I), i, n(J), j, n(K), k]. The elements are only non-zero when the symmetry rule holds. The reduced form R can be obtained by iterating over such non-zero blocks to eliminate one degree of freedom, eg, R[n(I), i, n(J), j, k] = T[n(I), i, n(J), j, n(I)+n(J), k] where the index n(K) is absent from R. In general we can remove any one of the n(X) indices.

In practical applications, we don’t start with initializing either T[I,J,A] or T[n(I), i, n(J), j, n(K), k]. Instead, we create the reduced form directly using conservation rule to avoid memory overhead. With input data properly initialized, all intermediates and output tensors will naturally carry the correct symmetry relation. In Section IV, we give a few relevant examples of these tensors for each application domain, and include references that describe further background on the methods. We have also added the above example to the text (adjusted to match notation in the paper) introducing our method.

Question 3.2: Is it possible to exploit multiple cyclic group symmetries (translations in two dimensions?)

Response 3.2: Yes, we support this in the symtensor software. We have revised the text to make this clear.

Question 3.3: Regarding non-uniform blocks: how does the algorithm perform when the padding with zeros is not applied?

Response 3.3: It is an essential feature that the blocks have the same size. This allows us to map the contraction operations into dense tensor algebra, which makes the approach easier to understand (without the need to refer to group theory) and makes it implementable with existing standard dense tensor numerical libraries.

Question 3.4: Does DPD have the same scaling advantages?

Response 3.4: Yes, the computational complexity of the DPD approach is the same. Our algorithm essentially performs DPD with tensor algebra, which enables its application more generally and with standard tensor primitives.

Question 3.5: Why does Symtensor not use the NumPy backend as the other methods in Figure 10?

Response 3.5: No particular reason, the BLAS backend to SymTensor could have been used here and achieves similar performance, as evidenced in other performance plots.

Question 3.6: Please add 1/N_core line for comparison in Figure 12.

Response 3.6: We have added this.

Question 3.7: Could the authors (briefly) comment on the possibility to include non-abelian symmetries in a similar way?

Response 3.7: We expect extension of this type of approach to handle non-Abelian symmetries to be highly non-trivial and, while we hope to study this in future work, we cannot at this point provide useful comment on the extent of challenges/potential. We have mentioned this in our discussion of future work, however.

Question 3.8: The repository linked to SciPost should be fixed.

Response 3.8: Sorry for the error, we inadvertently linked to a private fork of the repo. The newest submission should contain the correct link.

Question 3.9: It would be appreciated to see a successful application of SymTensor to a real physical problem.

Response 3.9: As we discuss in the general response, we would like to leave this for future work.

Thank you for pointing out the typos, we corrected all of those.

List of changes

We highlight significant changes in the revised preprint in red text. These include
* an explicit example of how to construct the reduced tensor used as input to our algorithm
* clarifications of breadth of support of symmetries in software library
* clarification to pseudocode
* further discussion of challenges and open questions in supporting variable block sizes and non-Abelian symmetry
We have also made corrections to all typos / minor comments made by reviewers, as discussed in the response.

Besides these, we have also made significant changes to the codebase
* complete documentation (https://solomonik.cs.illinois.edu/symtensor/)
* automated software testing (https://github.com/yangcal/symtensor/actions)
* improved build system and integration with modern tensor software wrappers

Published as SciPost Phys. Codebases 10-r1.3 (2023) , SciPost Phys. Codebases 10 (2023)

SciPost Submission Page

Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetry

by Yang Gao, Phillip Helms, Garnet Kin-Lic Chan, Edgar Solomonik

This Submission thread is now published as

Submission summary

Abstract

Author comments upon resubmission

List of changes

Login to report or comment