Computational Many-Body Theory: GWA Calculations

Computational Many-Body Theory Home

Object-orientation, efficiency, documentation and parallel implementations of quasiparticle calculations.

Wilfried G. Aulbur and John W. Wilkins
Department of Physics, The Ohio State University, Columbus, Ohio 43210, USA

Object Orientation

The common computing paradigm in most scientific applications is procedure-oriented programming. The central task of a software engineer using the procedure-oriented paradigm is to identify the procedures, i.e., the data manipulations, required to solve a problem and the corresponding best algorithms. The structure element of a procedure-oriented code is the procedure or subroutine with its input and output data. This way of programming leads in general to many global and only few local data. The global data are either passed explicitly from subroutine to subroutine or - as in Fortran - passed by using COMMON blocks. This violation of the software engineering principle of data locality1 can lead to severe problems once a complex piece of software has to be debugged, modified, or extended beyond the realm of its initial use.

An object-oriented programming paradigm2 conceives the task of solving a computational problem as identifying interacting classes or objects. A class is an abstract data type or a user-defined data type. It comprises data and all operations necessary to manipulate the data. For instance, the abstract data type VECTOR is the ensemble of the operations that create, destroy, and manipulate (scalar and vector multiplication, addition, etc.) a vector as well as an integer corresponding to its length and a pointer to its first element.

Classes consist of a header file that declares the class type - i.e. its data members and member functions - and a file containing the actual implementation of the class functions. A software user should be able to use a class efficiently in a particular application by looking only at the header file. Details of the representation of the data as well as algorithms used for a particular numerical problem are hidden from the user. Moreover, unauthorized user access to class data is prohibited. This principle of data hiding or information hiding leads to a modular programming style and reduces the use of global variables.

Classes as program modules should be written in a reusable and general way. For instance, a class VECTOR should not be limited to vector-vector operations but should also allow matrix-vector operations. Note that the principle of inheritance can increase the reusability of code considerably. The basic idea of inheritance is that common features of two different classes are incorporated into a base class whose properties are then inherited by the so-called derived classes. For example, consider a base class HUMAN with private data AGE and WEIGHT and member functions EAT, DRINK, and SLEEP. A derived class STUDENT would contain all the functionalities of HUMAN but on top of it the member functions ATTEND_LECTURE, DO_HOMEWORK, and GOOF_OFF.

Our reciprocal-space GW code is written in C++ in an object-oriented manner. Data abstraction at a low-level is achieved through extensive use of an in-house C++-library that defines the abstract data types VECTOR and MATRIX.3 At a higher level, the central classes of our code are symmetry, self-energy, polarizability, and wave_function. We will describe data encapsulation and code reusability using symmetry and wave_function as an example. The class symmetry can currently handle zincblende materials such as GaAs and elemental semiconductors such as Si. However, it provides the full functionality (apply a particular symmetry operation, find its inverse and apply the inverse, determine the little group of a vector, etc.) irrespective of the space group under consideration. Treating a system with, e.g., wurtzite symmetry therefore requires only hardcoding the corresponding symmetry operations in the implementation of symmetry (about 150 lines out of a total of 10000 lines of code). The class wave_function contains the k-point at which the wave function is defined, the number of bands and plane waves kept, the wave function coefficients and energies. Moreover, it contains basic operations such as time reversal, convolution, and Fourier transform. These basic operations are written in a general manner such that they apply not only for the construction of the matrix elements of the irreducible polarizability. They can be used as well for the plasmon pole eigenvectors (also of type wave_function) and the construction of the self-energy expectation values.

We have not found it necessary to use inheritance due to the simple class structure of the reciprocal space GW code. In principle, one could imagine a base class identical to a function defined on a mesh in G-space or real space including a mathematical transformation (the FFT) between those two spaces. Wavefunctions, local potentials, and charge densities could be designed as derived classes of this base class. Nonlocal quantities such as the self-energy would be derived classes of a base class consisting of a nonlocal function g(r,r') with the property

g(r+R,r'+R) = g(r, r'),

where R is a Bravais lattice vector, together with a six-dimensional Fourier transform. We are currently exploring advantages of such a class representation in the context of real space GW calculations.

Efficiency - Documentation - Reliability

C++ is less well established in the scientific community than Fortran. Therefore an interface with existing Fortran codes is a must. We obtain an efficient code by interpreting our wave functions as matrices rather than linear arrays. Hence, convolutions can be done using highly efficient BLAS3 routines. For example, for the calculations of the plasmon pole band structure using BLAS3 subroutines leads to major portions of our code that run at about 270 MFlops on a CRAY-YMP8 (peak performance 333 MFlops). The overall performance of our code for a typical Si plasmon pole band structure is about 170 MFlops with a computational intensity of about 2.0. A value of the computational intensity of about 2.0 proofs that our application uses memory very efficiently according to the standards of the CRAY performance tools.

An important point for the creation of maintainable software is documentation. Our code contains about 20% of documentation which makes it easy to maintain. Moreover, new users should be able to use the code after a relatively short start-up time.

Finally, the question of code reliability has to be addressed. We use an empirical pseudopotential4 to generate wave functions and energies as input for our self-energy code. We compare our results for the plasmon pole band structure as well as for the self-energy expectation values with an identical calculation by Andrzej Fleszar of the University of Würzburg, Germany. The results of the two independent codes agree to nine significant digits or better.

Parallel Implementations of GW calculations.

Reciprocal Space

Here we will describe our parallelization scheme using the irreducible polarizability as an example. Let PE denote a processing element and npes the number of processors used during the course of a calculation. Furthermore, let ncb be the number of conduction bands used in the calculation, nvb the number of valence bands, npw the number of plane waves, i.e., the size of our basis, and nG the number of reciprocal space vectors G. All four quantities ncb, nvb, npw, and nG, increase linearly with the size , e.g. the number of atoms Natom, of the system under consideration. PE0 denotes the PE that does all the I/O between the file system and the T3D. For each k-vector PE0 reads the energies and the G-vectors. Then PE0 reads the first ncb/npes conduction bands (we assume mod(ncb/npes)=0 for simplicity) and sends the information about k-point, energies, G-vectors, and the first block of conduction bands to PE1. Subsequently, PE0 reads the second block of conduction bands and sends it together with the k-point, energies, and G-vectors to PE2, and so on. Once all conduction bands are distributed, PE0 reads the valence band information for a point k-q in the BZq and broadcasts that information to all PEs. Then all PEs - including PE0 - calculate the following expression

                            
 (i+1)n_cb npes-1  occ     vc          vc      *
 ---------------- -----   M  (k,q) * [M  (k,q)]
  \                \     __G___________G'_______
  /                /                            
 ---------------- -----      E (k-q) - E (k)    
 c = i n_cb npes    v         v         c       


Finally, the results of this calculation are added up via a single-node accumulation on PE0. For a graphical description of our algorithm see Fig. 1.

The important features of our parallel implementation are:

Storage requirements

The storage needed for complex wave functions on each PE is proportional to
(nvb + ncb /npes)*npw.
The memory requirements for wave functions increase approximately linear with system size since nvb<<ncb and since we can increase the number of processors to compensate for increasing system size. For the matrix elements MGvc the storage requirements are proportional to:
nvb*(ncb /npes)*nG,
which leads to a quadratic memory increase with system size. Hence, for both wave functions and matrix elements we gain at least one order of Natom by using massively parallel computers.

In practical terms, for a 512-node T3D such as the T3D of the Pittsburgh Supercomputer Center or the T3D of the Edinburgh Parallel Computing Center in England, the storage requirements are about 0.1 MW for the wave functions and 1.5 MW for the matrix elements in the case of (GaN)16. This information fits easily on a single node of the T3D with its 8 MW storage capacity. For much larger systems that require more memory we can resort to the SP2 at the Cornell Theory Center whose ``thin'' nodes already offer 32 MW memory per node and whose ``wide'' nodes possess up to 256 MW memory.

Scalability

Our parallel algorithm is scalable. The increase of the volume of the system by a factor of 2 leads to twice as many conduction bands. The corresponding matrix elements can be calculated efficiently by using twice as many processors. Moreover, scalability applies to memory usage as well as discussed in (i).

Load balance

The load balance is excellent. Not counting the I/O overhead all PEs have exactly the same work load if mod(ncb,npes)=0. If mod(ncb,npes) is not equal to 0, then the last npes-1 PEs calculate the matrix elements MG vc for ncb/npes conduction bands, while PE0 calculates MG vc for the remaining conduction bands. This distribution of work load compensates partially for the extra work that PE0 has to do to assemble P0G G'(q;0) at the end of the calculation (compare Fig. 1).

Communication and portability

Our approach is coarse grain parallel and portable. PE0 reads and sends the input information and wave function data to all other processors. These calculate the matrix elements for every k-point and send the results of their calculation back to PE0. PE0 then assembles the total polarizability. Therefore, for a fixed k-point, communication occurs only at the beginning and at the end of the computation. Since our program is coarse grain parallel, we can afford to use PVM as communication primitive. PVM ensures portability of our code to different platforms.

Note also that the use of symmetry does not require any inter-processor communication since the matrix elements for all G-vectors and hence for their images under symmetry operations of the little group of q are created and used locally on one processor.

Object orientation - Use of C++

Writing our code in C++ and running it on the T3D has been an exciting endeavor since we are one of the very few, if not the only, C++ users on this machine. This endeavor has to be warranted by important pay-offs in writing and maintaining a code. Here we mention two:
  1. One central object in our code are wave functions. As described above this class comprises the k-point, at which the wave function is defined, the number of bands and plane waves kept, the wave function coefficients and energies. Sending a wave function wvfnctn from processor my_pe to processor your_pe only requires the following statement:
    
    wvfnctn.send(my_pe,your_pe); or wvfnctn.receive(my_pe,your_pe);.
    

    The actual sending of the data and the communication primitive used in this process are hidden from the user. Note, that this feature makes a change from PVM to, e.g., MPI easy, since the change is done in only one subroutine. Switching between PVM and MPI can then conveniently be achieved via preprocessor directives.

  2. During the course of our code development we changed the parallelization scheme three times. We started out using the embarrassingly parallel approach of distributing k-vectors over PEs,5 switched to a parallelization over reciprocal lattice vectors, G, and finally decided to parallelize over conduction bands as described above. Changing the code to accommodate new parallelization schemes profited immensely from the flexibility that objects, i.e., data and their functionality, offer compared to a procedure-oriented programming style.

Real Space

In the case of the reciprocal space GW calculation we saw that scalability could be attained by parallelizing over the conduction bands. In real space, the r-vectors in the irreducible real space unit cell are an appropriate variable for a scalable parallel implementation. The number of vectors in the irreducible real space unit cell is directly proportional to system size and hence the number of atoms in the system. A system that increases its size by a factor of two can therefore be handled by using twice as many processors. The important operations in the program are the assembly of the Green function and the irreducible polarizability in real space and imaginary time and the subsequent transformation to reciprocal space and imaginary frequency. All subsequent operations in the code are either the exact inversion (for instance one-to-all broadcast and single-node accumulation) or slight modifications of these basic operations. We will therefore discuss parallelism only up to the construction of the screened interaction in reciprocal space and imaginary frequency.

As in the case of the reciprocal space implementation we use managed I/O via PE0. PE0 starts by reading input information about the lattice, k-points, LDA energies, etc. and by broadcasting it to all other PEs. Then in a small serial portion of the program PE0 reads the Fourier coefficients for each band and FFTs them to obtain the periodic parts of the Bloch wave functions unk(r) and u*nk(r'). These wave functions are written to direct access files. For unk(r) the record index is the FFT index of the r-vector in the real space unit cell. The record contains unk(r) for all bands and all k-points. For unk(r') the record index is the number of the special k-point and the record contains all r'-points in the real space unit cell for all bands.

Now an external loop over imaginary time starts in which we calculate the non-interacting propagator at imaginary times ±i\tau and for all r-vectors in the irreducible real space unit cell. Each PE has a list of r-vectors for which it calculates the non-interacting propagator G0(r,r';±i\tau) for a particular time i\tau and all r' in the sphere. For example, if the number nviuc of r-points in the irreducible real space unit cell is 55 and the number of processors is 4, then PE0 calculates G0 for the first 13 r-vectors, PE1 for the following 14, and so on.

To calculate G0 at a point ri each PE needs unk(ri) for all bands and all k-points as well as u*nk(r') for all r', all bands, and all k-points. Let rank denote the number of a PE. For each i and each node PE 0 reads the corresponding wave function information unk(ri) in real space and sends it to PE rank with a message tag rank. PE rank issues a corresponding receive with message tag rank. This guarantees, that no message can be sent to the wrong PE. To avoid exhaustion of system resources, the send and receives are done sequentially, one after the other. At the end PE 0 reads the information that it needs for its own calculations. Subsequently in a loop over k-points, PE0 reads unk(r') for a particular k-point and broadcasts this wave function for all bands and all r' in the real space unit cell to all PEs (compare Fig. 2). After each PE calculated its non-interacting propagator at ±i\tau it sends the product G0(ri,r';+i\tau)G0(ri,r'; -i\tau) - i.e. the irreducible polarizability upto a constant factor - to PE0 and PE0 writes the result to disk. This communication is the inverse operation of the distribution of unk(ri) described above. A graphical description of this communication step is given in Fig. 2.

In a next step PE0 reads the irreducible polarizability P0 for different vectors ri in the irreducible real space unit cell and distributes P0(ri, r';i\tau) over the different processors in an analogous way as is done for the distribution of unk(ri). Each PE performs the first 3D FFT over r' on node turning P0(ri,r';i\tau) into P0(k,ri,G';i\tau). Moreover each PE sets up the input data for the next 3D FFT, i.e., it unfolds the irreducible polarizability from the irreducible to the full real space unit cell.

Note, that we obtain close to perfect load balancing for the construction of the Green function by parallelizing over the r-vectors in the irreducible real space unit cell. For the construction of the irreducible polarizability in reciprocal space load balancing is less favorable since now the symmetry of the r-vectors in the irreducible real space unit cell is important as well. However, larger systems have a lot of r-vectors in the irreducible real space unit cell and low symmetry. Therefore, differences in symmetries should average out when we consider large unit cells.

Two more points are important for the 3D FFT: (i) We do not keep all G'-vectors in P0(k,r,i,G';i\tau) but only those with |G'|<=Gmax. In the case of Si, Gmax = 3.4 Hartree which corresponds to 169 G'-vectors. (ii) After the first 3D FFT each PE contains all G'-vectors and its chunk of r-vectors in the irreducible real space unit cell as well as their images under symmetry operations of the space group of the crystal. Since the distribution of the r-vectors is not contiguous in any coordinate we cannot use parallel FFTs to perform the next 3D FFT. We perform a ``block-column'' gather operation using the MPI_GATHER function which is illustrated in Fig. 3. The basic idea is to assemble all r-vectors and a block of G'-vectors on one PE such that we can do a local 3D FFT on P0(k, r, G';i\tau) after reordering the r-vectors in FFT order.

Hence we finally end up with P0(k,G,G';i\tau) for all G-vectors and a block of G'-vectors which is written to disk via PE0 for all k-points and all times i\tau. The next step is the Fourier transform with respect to imaginary time. To do this Fourier transform, PE0 reads quadratic submatrices in (G,G') of P0(k,G,G';i\tau) for all times i\tau and distributes these submatrices over a quadratic grid of processors. The time Fourier transform is done locally. Further matrix operations such as the construction of the RPA dielectric constant and of the screened interaction are done using ScaLapack routines for the multiplication and inversion of distributed matrices.

Figure Captions:

Figure 1:
Schematic diagram of a parallel calculation of the irreducible polarizability in reciprocal space. The same scheme applies to the calculation of the self-energy in reciprocal space. PE0 manages disk I/O and distributes the valence and blocks of the conduction band wave functions over all PEs. Each PE calculates its partial contribution to the irreducible polarizability. The partial contributions are summed up at the end of the calculation via a single-node accumulation on PE0.
Figure 2:
Schematic diagram of the communication structure of the parallel implementation of the real space GW code. We list all communication and calculational steps needed to determine the irreducible polarizability P0 in real space and imaginary time and to transform it to reciprocal space. The distribution of the unk(r) as well as P0(r,r';i\tau) over all PEs is described in the text. The redistribution step using MPI_GATHER is detailed in Fig. 3. The Fourier transformation from imaginary time to imaginary frequency and assembly of the screened interaction via parallel matrix operations is described in the text.
Figure 3:
Communication structure for the redistribution of irreducible polarizability over PEs. After the first 3D FFT each PE has the irreducible polarizability P0(k,r, G',i\tau) for one particular k-point, a subset of r-vectors in the irreducible real space unit cell as well as their images under symmetry operations of the space group, all G'-vectors, and one particular time i\tau. As indicated in the figure we use MPI_GATHER to assemble P0 for all r-vectors and a block of G'-vectors on each PE (first block on PE0, second on PE1, and so on), reorder P0 according to its r argument to have standard FFT ordering and perform the second 3D FFT again on node. The different length of the r-blocks in the picture corresponds to a different number of r-points on each PE. The total number of G'-vectors kept after performing the first FFT is nG' and npes is the number of PEs used. In the above example mod(nG',npes)=2.

References:

  1. B. Stroustrup, The C++ programming language, 2nd edition (Addison-Wesley, New York, 1991), p 14ff.
  2. M. A. Ellis and B. Stroustrup, The annotated C++ reference manual (Addison-Wesley, New York, 1991).
  3. This library was written and is maintained by W. Wenzel, University of Dortmund, Germany, and M. M. Steiner, Ohio State University, Ohio, USA.
  4. M. L. Cohen and T. K. Bergstresser, Phys. Rev. 141, 789 (1966).
  5. W. G. Aulbur and J. W. Wilkins, Bulletin of the American Physical Society 40, No. 1, 251 (1995).

Computational Many-Body Theory Home

To cite this page:
Computational Many-Body Theory
<http://www.physics.ohio-state.edu/~aulbur/computing/computing.html>

Edited by: aulbur@campbell.mps.ohio-state.edu