8th order DIMxDIM matrix-vector multiplier on 1-4 intra-node processors with OpenMP parallelization or on 1-4 one-processor nodes with MPI parallelization. `sync' means that syncronization and vector gather-scatter was performed after each matrix-vector multiplication, while `unsync' means that no thread synchronization was done between multiplications.
Conclusions: There is a maximum performance of 180 Mflops. The sustainable performance for larger matrices is 100 Mflops for both intra- and inter-node parallelization. For smaller matrices syncronization is moderately degrading for intra-node (OpenMP) parallelization, while severely degrading for inter-node (MPI) parallelization.
DIM = 60:
unsync sync
proc OMP MPI OMP MPI
1 177 175 175 167
2 178 177 162 104
3 175 179 143 70
4 175 183 130 50
DIM = 180:
unsync sync
proc OMP MPI OMP MPI
1 174 156 172 169
2 174 174 166 146
3 173 174 159 125
4 171 174 155 109
DIM = 600:
unsync sync
proc OMP MPI OMP MPI
1 105 108 105 105
2 98 106 98 102
3 99 107 98 102
4 101 111 99 100
DIM = 960:
unsync sync
proc OMP MPI OMP MPI
1 105 108 105 104
2 98 105 97 103
3 97 105 96 101
4 96 105 94 102
The batch system that we use is PBS version 2.2. There is an optimized version of BLAS developed as part of the ASCI project, it is located in /usr/local/lib/libblas-single.a. Some codes that have been linked against this have shown performance increases of over 300% (Charlotte Elsters' 3 boson code).
For update, see OSC Hardware WebPage