Beowulf Cluster, Ohio Supercomputer Center

Performance Data

8th order DIMxDIM matrix-vector multiplier on 1-4 intra-node processors with OpenMP parallelization or on 1-4 one-processor nodes with MPI parallelization. `sync' means that syncronization and vector gather-scatter was performed after each matrix-vector multiplication, while `unsync' means that no thread synchronization was done between multiplications.

Conclusions: There is a maximum performance of 180 Mflops. The sustainable performance for larger matrices is 100 Mflops for both intra- and inter-node parallelization. For smaller matrices syncronization is moderately degrading for intra-node (OpenMP) parallelization, while severely degrading for inter-node (MPI) parallelization.

DIM = 60: 
       unsync      sync
proc  OMP MPI     OMP MPI
  1   177 175     175 167   
  2   178 177     162 104
  3   175 179     143  70
  4   175 183     130  50
 
DIM = 180:
       unsync      sync
proc  OMP MPI     OMP MPI
  1   174 156     172 169
  2   174 174     166 146
  3   173 174     159 125
  4   171 174     155 109

DIM = 600:
       unsync      sync
proc   OMP MPI    OMP MPI
  1    105 108    105 105
  2     98 106     98 102
  3     99 107     98 102
  4    101 111     99 100

DIM = 960:
       unsync      sync
proc   OMP MPI    OMP MPI
  1    105 108    105 104
  2     98 105     97 103
  3     97 105     96 101
  4     96 105     94 102

Cluster Description

The batch system that we use is PBS version 2.2. There is an optimized version of BLAS developed as part of the ASCI project, it is located in /usr/local/lib/libblas-single.a. Some codes that have been linked against this have shown performance increases of over 300% (Charlotte Elsters' 3 boson code).

For update, see OSC Hardware WebPage


Your comments and suggestions are appreciated.

To cite this page:
Beowulf Cluster, Ohio Supercomputer Center
<http://www.physics.ohio-state.edu>
[]
Edited by: wilkins@mps.ohio-state.edu on