Computational Many-Body Theory
Wilfried G. Aulbur and John W. Wilkins
Object-orientation, efficiency, documentation and parallel
implementations of quasiparticle calculations.
Department of Physics, The Ohio State University, Columbus, Ohio
43210, USA
An object-oriented programming paradigm2
conceives the task of solving a
computational problem as identifying interacting classes or objects.
A class is an abstract data type or a user-defined data type. It
comprises data and all operations necessary to manipulate the data.
For instance, the abstract data type VECTOR is the ensemble of the
operations that create, destroy, and manipulate (scalar and vector
multiplication, addition, etc.) a vector as well as an integer
corresponding to its length and a pointer to its first element.
Classes consist of a header file that declares the class type - i.e.
its data members and member functions - and a file containing the
actual implementation of the class functions. A software user should be
able to use a class efficiently in a particular application by looking
only at the header file. Details of the representation of the data as
well as algorithms used for a particular numerical problem are hidden
from the user. Moreover, unauthorized user access to class data is
prohibited. This principle of data hiding or information
hiding leads to a modular programming style and reduces the use of
global variables.
Classes as program modules should be written in a reusable and general
way. For instance, a class VECTOR should not be limited to
vector-vector operations but should also allow matrix-vector operations.
Note that the principle of inheritance can increase the
reusability of code considerably.
The basic idea of inheritance is that
common features of two different classes are incorporated into a base
class whose properties are then inherited by the so-called derived
classes. For example, consider a base class HUMAN with private
data AGE and WEIGHT and member functions EAT,
DRINK, and SLEEP. A derived class STUDENT
would contain all
the functionalities of HUMAN but on top of it the member functions
ATTEND_LECTURE, DO_HOMEWORK, and GOOF_OFF.
Our reciprocal-space GW code is written in C++ in an object-oriented
manner. Data abstraction at a low-level is achieved through extensive
use of an in-house C++-library that defines the abstract data types
VECTOR and MATRIX.3 At a higher level, the
central classes of our code are symmetry, self-energy,
polarizability, and wave_function. We will describe data
encapsulation and code reusability using symmetry and
wave_function as an example. The class symmetry can currently
handle zincblende materials such as GaAs and elemental semiconductors
such as Si. However, it provides the full functionality (apply a
particular symmetry operation, find its inverse and apply the inverse,
determine the little group of a vector, etc.) irrespective of the space
group under consideration. Treating a system with, e.g., wurtzite symmetry
therefore requires only hardcoding the corresponding symmetry operations
in the implementation of symmetry (about 150 lines out of a total of
10000 lines of code). The class wave_function contains the
k-point at which the wave function is defined, the number of bands and
plane waves kept, the wave function coefficients and energies. Moreover,
it contains basic operations such as time reversal, convolution, and
Fourier transform. These basic operations are written in a general
manner such that they apply not only for the construction of the matrix
elements of the irreducible polarizability.
They can be used as well for the plasmon pole eigenvectors (also of type
wave_function) and the construction of the self-energy expectation
values.
We have not found it necessary to use inheritance due to the simple
class structure of the reciprocal space GW code. In principle, one
could imagine a base class identical to a function defined on a mesh in
G-space or real space including a mathematical transformation (the
FFT) between those two spaces. Wavefunctions, local potentials, and
charge densities could be designed as derived classes of this base
class.
Nonlocal quantities such as the self-energy would be derived classes of
a base class consisting of a nonlocal function g(r,r')
with the property
Object Orientation
The common computing paradigm in most scientific applications is
procedure-oriented programming. The central task of a software engineer
using the procedure-oriented paradigm is to identify the procedures,
i.e., the data manipulations, required to solve a problem and the
corresponding best algorithms. The structure element of a
procedure-oriented code is the procedure or subroutine with its input
and output data. This way of programming leads in general to many
global and only few local data. The global data are either passed
explicitly from subroutine to subroutine or - as in Fortran - passed
by using COMMON blocks. This violation of the software engineering
principle of data locality1
can lead to severe problems once a
complex piece of software has to be debugged, modified, or extended
beyond the realm of its initial use.
where R is a Bravais lattice vector, together with a six-dimensional Fourier transform. We are currently exploring advantages of such a class representation in the context of real space GW calculations.
An important point for the creation of maintainable software is documentation. Our code contains about 20% of documentation which makes it easy to maintain. Moreover, new users should be able to use the code after a relatively short start-up time.
Finally, the question of code reliability has to be addressed. We use an empirical pseudopotential4 to generate wave functions and energies as input for our self-energy code. We compare our results for the plasmon pole band structure as well as for the self-energy expectation values with an identical calculation by Andrzej Fleszar of the University of Würzburg, Germany. The results of the two independent codes agree to nine significant digits or better.
(i+1)n_cb npes-1 occ vc vc *
---------------- ----- M (k,q) * [M (k,q)]
\ \ __G___________G'_______
/ /
---------------- ----- E (k-q) - E (k)
c = i n_cb npes v v c
Finally, the results of this calculation are added up via a single-node accumulation on PE0. For a graphical description of our algorithm see Fig. 1.
The important features of our parallel implementation are:
In practical terms, for a 512-node T3D such as the T3D of the Pittsburgh Supercomputer Center or the T3D of the Edinburgh Parallel Computing Center in England, the storage requirements are about 0.1 MW for the wave functions and 1.5 MW for the matrix elements in the case of (GaN)16. This information fits easily on a single node of the T3D with its 8 MW storage capacity. For much larger systems that require more memory we can resort to the SP2 at the Cornell Theory Center whose ``thin'' nodes already offer 32 MW memory per node and whose ``wide'' nodes possess up to 256 MW memory.
Note also that the use of symmetry does not require any inter-processor communication since the matrix elements for all G-vectors and hence for their images under symmetry operations of the little group of q are created and used locally on one processor.
wvfnctn.send(my_pe,your_pe); or wvfnctn.receive(my_pe,your_pe);.
The actual sending of the data and the communication primitive used in this process are hidden from the user. Note, that this feature makes a change from PVM to, e.g., MPI easy, since the change is done in only one subroutine. Switching between PVM and MPI can then conveniently be achieved via preprocessor directives.
As in the case of the reciprocal space implementation we use managed I/O via PE0. PE0 starts by reading input information about the lattice, k-points, LDA energies, etc. and by broadcasting it to all other PEs. Then in a small serial portion of the program PE0 reads the Fourier coefficients for each band and FFTs them to obtain the periodic parts of the Bloch wave functions unk(r) and u*nk(r'). These wave functions are written to direct access files. For unk(r) the record index is the FFT index of the r-vector in the real space unit cell. The record contains unk(r) for all bands and all k-points. For unk(r') the record index is the number of the special k-point and the record contains all r'-points in the real space unit cell for all bands.
Now an external loop over imaginary time starts in which we calculate the non-interacting propagator at imaginary times ±i\tau and for all r-vectors in the irreducible real space unit cell. Each PE has a list of r-vectors for which it calculates the non-interacting propagator G0(r,r';±i\tau) for a particular time i\tau and all r' in the sphere. For example, if the number nviuc of r-points in the irreducible real space unit cell is 55 and the number of processors is 4, then PE0 calculates G0 for the first 13 r-vectors, PE1 for the following 14, and so on.
To calculate G0 at a point ri each PE needs unk(ri) for all bands and all k-points as well as u*nk(r') for all r', all bands, and all k-points. Let rank denote the number of a PE. For each i and each node PE 0 reads the corresponding wave function information unk(ri) in real space and sends it to PE rank with a message tag rank. PE rank issues a corresponding receive with message tag rank. This guarantees, that no message can be sent to the wrong PE. To avoid exhaustion of system resources, the send and receives are done sequentially, one after the other. At the end PE 0 reads the information that it needs for its own calculations. Subsequently in a loop over k-points, PE0 reads unk(r') for a particular k-point and broadcasts this wave function for all bands and all r' in the real space unit cell to all PEs (compare Fig. 2). After each PE calculated its non-interacting propagator at ±i\tau it sends the product G0(ri,r';+i\tau)G0(ri,r'; -i\tau) - i.e. the irreducible polarizability upto a constant factor - to PE0 and PE0 writes the result to disk. This communication is the inverse operation of the distribution of unk(ri) described above. A graphical description of this communication step is given in Fig. 2.
In a next step PE0 reads the irreducible polarizability P0 for different vectors ri in the irreducible real space unit cell and distributes P0(ri, r';i\tau) over the different processors in an analogous way as is done for the distribution of unk(ri). Each PE performs the first 3D FFT over r' on node turning P0(ri,r';i\tau) into P0(k,ri,G';i\tau). Moreover each PE sets up the input data for the next 3D FFT, i.e., it unfolds the irreducible polarizability from the irreducible to the full real space unit cell.
Note, that we obtain close to perfect load balancing for the construction of the Green function by parallelizing over the r-vectors in the irreducible real space unit cell. For the construction of the irreducible polarizability in reciprocal space load balancing is less favorable since now the symmetry of the r-vectors in the irreducible real space unit cell is important as well. However, larger systems have a lot of r-vectors in the irreducible real space unit cell and low symmetry. Therefore, differences in symmetries should average out when we consider large unit cells.
Two more points are important for the 3D FFT: (i) We do not keep all G'-vectors in P0(k,r,i,G';i\tau) but only those with |G'|<=Gmax. In the case of Si, Gmax = 3.4 Hartree which corresponds to 169 G'-vectors. (ii) After the first 3D FFT each PE contains all G'-vectors and its chunk of r-vectors in the irreducible real space unit cell as well as their images under symmetry operations of the space group of the crystal. Since the distribution of the r-vectors is not contiguous in any coordinate we cannot use parallel FFTs to perform the next 3D FFT. We perform a ``block-column'' gather operation using the MPI_GATHER function which is illustrated in Fig. 3. The basic idea is to assemble all r-vectors and a block of G'-vectors on one PE such that we can do a local 3D FFT on P0(k, r, G';i\tau) after reordering the r-vectors in FFT order.
Hence we finally end up with P0(k,G,G';i\tau) for all G-vectors and a block of G'-vectors which is written to disk via PE0 for all k-points and all times i\tau. The next step is the Fourier transform with respect to imaginary time. To do this Fourier transform, PE0 reads quadratic submatrices in (G,G') of P0(k,G,G';i\tau) for all times i\tau and distributes these submatrices over a quadratic grid of processors. The time Fourier transform is done locally. Further matrix operations such as the construction of the RPA dielectric constant and of the screened interaction are done using ScaLapack routines for the multiplication and inversion of distributed matrices.
Computational Many-Body Theory
To cite this page:
Computational Many-Body Theory
<http://www.physics.ohio-state.edu/~aulbur/computing/computing.html>