SciPost Submission Page
Automatic transformation of irreducible representations for efficient contraction of tensors with cyclic group symmetry
by Yang Gao, Phillip Helms, Garnet KinLic Chan, Edgar Solomonik
This Submission thread is now published as
Submission summary
Authors (as registered SciPost users):  Edgar Solomonik 
Submission information  

Preprint Link:  scipost_202209_00050v1 (pdf) 
Code repository:  https://github.com/yangcal/symtensor/ 
Date accepted:  20220927 
Date submitted:  20220923 17:37 
Submitted by:  Solomonik, Edgar 
Submitted to:  SciPost Physics Codebases 
Ontological classification  

Academic field:  Physics 
Specialties: 

Approach:  Computational 
Abstract
Tensor contractions are ubiquitous in computational chemistry and physics, where tensors generally represent states or operators and contractions express the algebra of these quantities. In this context, the states and operators often preserve physical conservation laws, which are manifested as group symmetries in the tensors. These group symmetries imply that each tensor has block sparsity and can be stored in a reduced form. For nontrivial contractions, the memory footprint and cost are lowered, respectively, by a linear and a quadratic factor in the number of symmetry sectors. Stateoftheart tensor contraction software libraries exploit this opportunity by iterating over blocks or using general blocksparse tensor representations. Both approaches entail overhead in performance and code complexity. With intuition aided by tensor diagrams, we present a technique, irreducible representation alignment, which enables efficient handling of Abelian group symmetries via only dense tensors, by using contractionspecific reduced forms. This technique yields a general algorithm for arbitrary group symmetric contractions, which we implement in Python and apply to a variety of representative contractions from quantum chemistry and tensor network methods. As a consequence of relying on only dense tensor contractions, we can easily make use of efficient batched matrix multiplication via Intel's MKL and distributed tensor contraction via the Cyclops library, achieving good efficiency and parallel scalability on up to 4096 Knights Landing cores of a supercomputer.
Author comments upon resubmission
Multiple reviewers suggested that the article is more appropriate for SciPost Codebases, and we have resubmitted the article to the Codebases section.
We apologize for linking to the wrong software repository when entering the submission, as well as for the lack of documentation. To improve the software usability and adhere to the SciPost Codebases submission guidelines, we have modernized the build system, added complete documentation for the software, and added automated testing (see https://github.com/yangcal/symtensor and https://solomonik.cs.illinois.edu/symtensor/).
Two of the reviewers suggested the paper would be improved by a complete physics application. Our intent has been to pursue multiple separate applicationaudiencetargeted studies to the efficacy of this approach in different application areas (i.e., coupled cluster methods for quantum chemistry and DMRGbased methods for tensor networks simulation in condensed matter physics). Indeed, we expect such evaluation to be nontrivial and for our method to fare differently depending on context, since our approach incurs an overhead that is dependent on the distribution of symmetry block sizes. However, both students leading this research project have recently completed their PhD and moved on to other positions, so we are currently not in position to expand the scope of work to fullscale simulations.
Responses to Specific Reviewer Comments:
Reviewer 1:
Question 1.1: More detailed discussion of U(1) symmetry, which typically arises in physics context, is missing. If U(1) is not treated approximately, i.e. by some finite (cyclic) subgroup, the number of differently charged blocks is unbounded. Typically, the size of the charged sectors (often labeled N in the manuscript) follows Gaussian distribution with respect to U(1) charge. How does the symtensor approach compare to "loop over block" approach in such case ?
Response 1.1: The current approach can strictly emulate U(1) by choosing a finite cyclic subgroup where the order of the group is so large such that every U(1) irrep encountered in the simulation maps to a distinct irrep of the cyclic group. The efficiency compared to the loop over blocks implementation is rather application specific. In principle, the loopoverblocks approach can be more computationally efficient as the use of our method requires padding blocks, and if there is a wide distribution of block sizes then this padding is inefficient. However, using small block sizes is inefficient with any approach. Thus, by choosing a smaller cyclic subgroup symtensor may be more efficient than a U(1) approach using the loopoverblocks implementation.
As discussed in the general response above, however, we would like to leave investigation of full applications / specific symmetries arising in physical systems, to future work. We have added some further discussion of U(1) as well as the padding overhead to Sections 3 and 4.
Question 1.2: The "loop over blocks" method, adopted widely in other libraries, in theory allows for parallelization of the loop since the contractions of individual blocks are independent. Do the comparisons in the manuscript use such parallelization of "loop over blocks" algorithm ?
Response 1.2: We did not employ such a parallelization in our tests. While this approach would be easy to do for a sharedmemory parallelization, and would achieve better singlenode performance with proper management of load balance, such manual parallelization is much less straightforward for distributedmemory systems. On the other hand, by using a single large tensor contraction, our approach can use existing libraries for tensor contraction to achieve parallelism on a variety of architectures (sharedmemory/GPU/distributed/hybrid) with only a change of tensor contraction backend.
Question 1.3: Why do examples MM_b, MPS_a, MPS_b, PEPS_a, and PEPS_B in Fig. 11
using BLAS perform slightly better than CTF ?
Response 1.3: CTF is executed with 64 processes with distinct memory address spaces, while BLAS is leveraging threading across the node with a single shared address space. Generally, but not always, a threadbased parallelization should outperform a process (MPI)based parallelization on a single node. Of course CTF has the advantage of being able to use multiple nodes, enabling larger tensors by partitioning each tensor’s data across many nodes and achieving faster performance. An advantage of our algorithm and software library is that we can simultaneously support both of these types of parallelizations via existing backends.
Question 1.4: (in Minor remarks) I believe beside ITensor and Uni10, the TenPy project is missing among the leading packages for abeliansymmetric tensor network computation frameworks (targeting MPS in this case)
Response 1.4: We have added this reference and addressed the other correction, thank you for these!
Reviewer 2
Question 2.1: In Algorithm 2.2, it's not clear (to me) what the two rows of the coefficient vectors are representing.
Response 2.1: The coefficient vectors are partitioned into two subvectors. One subvector corresponds to coefficients associated with contracted indices, while the other is associated with the uncontracted indices of that tensor. We have added further clarification of this in the pseudocode.
Question 2.2: In Section 3, you write that the library also supports U(1) symmetries. However, in that case there's no reason to assume that the various symmetry blocks have equal sizes, while the presented algorithm assumes this. Can you please comment on that? You mentioned earlier that you just fill up blocks with explicit zeros, so do you significantly loose sparsity in this case?
Response 2.2: Indeed, even for the cyclic subgroups explicitly addressed by the library, there is no reason to assume that the symmetry blocks have equal sizes a priori. Indeed to apply our approach in such cases padding with explicit zeros would be necessary to use the dense tensor representations as we employ. We expect the extent of the overhead to be application/systemdependent and can envision ways to apply our approach in a more sophisticated way with less padding (e.g., bucket blocks by size). However, we would like to leave this investigation for future work.
Thank you also for pointing out the typos, we have corrected these.
Reviewer 3
Question 3.1: Could the author (briefly) explain how to obtain the reduced form of a tensor for a specific problem? It would be useful to add some references.
Response 3.1: As an explicit example, consider a rank3 tensor T[I, J, K] where the symmetry rule requires that n(I) + n(J) – n(K) = 0 (n(X) denotes the irrep for index X; an explicit example would be where T is the tensor in an MPS simulation, and n represents the particle number index). Assuming each block has equal size, T[I, J, K] can be unfolded into T[n(I), i, n(J), j, n(K), k]. The elements are only nonzero when the symmetry rule holds. The reduced form R can be obtained by iterating over such nonzero blocks to eliminate one degree of freedom, eg, R[n(I), i, n(J), j, k] = T[n(I), i, n(J), j, n(I)+n(J), k] where the index n(K) is absent from R. In general we can remove any one of the n(X) indices.
In practical applications, we don’t start with initializing either T[I,J,A] or T[n(I), i, n(J), j, n(K), k]. Instead, we create the reduced form directly using conservation rule to avoid memory overhead. With input data properly initialized, all intermediates and output tensors will naturally carry the correct symmetry relation. In Section IV, we give a few relevant examples of these tensors for each application domain, and include references that describe further background on the methods. We have also added the above example to the text (adjusted to match notation in the paper) introducing our method.
Question 3.2: Is it possible to exploit multiple cyclic group symmetries (translations in two dimensions?)
Response 3.2: Yes, we support this in the symtensor software. We have revised the text to make this clear.
Question 3.3: Regarding nonuniform blocks: how does the algorithm perform when the padding with zeros is not applied?
Response 3.3: It is an essential feature that the blocks have the same size. This allows us to map the contraction operations into dense tensor algebra, which makes the approach easier to understand (without the need to refer to group theory) and makes it implementable with existing standard dense tensor numerical libraries.
Question 3.4: Does DPD have the same scaling advantages?
Response 3.4: Yes, the computational complexity of the DPD approach is the same. Our algorithm essentially performs DPD with tensor algebra, which enables its application more generally and with standard tensor primitives.
Question 3.5: Why does Symtensor not use the NumPy backend as the other methods in Figure 10?
Response 3.5: No particular reason, the BLAS backend to SymTensor could have been used here and achieves similar performance, as evidenced in other performance plots.
Question 3.6: Please add 1/N_core line for comparison in Figure 12.
Response 3.6: We have added this.
Question 3.7: Could the authors (briefly) comment on the possibility to include nonabelian symmetries in a similar way?
Response 3.7: We expect extension of this type of approach to handle nonAbelian symmetries to be highly nontrivial and, while we hope to study this in future work, we cannot at this point provide useful comment on the extent of challenges/potential. We have mentioned this in our discussion of future work, however.
Question 3.8: The repository linked to SciPost should be fixed.
Response 3.8: Sorry for the error, we inadvertently linked to a private fork of the repo. The newest submission should contain the correct link.
Question 3.9: It would be appreciated to see a successful application of SymTensor to a real physical problem.
Response 3.9: As we discuss in the general response, we would like to leave this for future work.
Thank you for pointing out the typos, we corrected all of those.
List of changes
We highlight significant changes in the revised preprint in red text. These include
* an explicit example of how to construct the reduced tensor used as input to our algorithm
* clarifications of breadth of support of symmetries in software library
* clarification to pseudocode
* further discussion of challenges and open questions in supporting variable block sizes and nonAbelian symmetry
We have also made corrections to all typos / minor comments made by reviewers, as discussed in the response.
Besides these, we have also made significant changes to the codebase
* complete documentation (https://solomonik.cs.illinois.edu/symtensor/)
* automated software testing (https://github.com/yangcal/symtensor/actions)
* improved build system and integration with modern tensor software wrappers
Published as SciPost Phys. Codebases 10r1.3 (2023) , SciPost Phys. Codebases 10 (2023)