780.20: 1094 Session 15
Handouts: Printouts of mpi_hello_world.cpp,
mpi_send_receive.cpp, mpi_calculate_pi.cpp, multimin_sa_compare.cpp,
simpson_cosint_openmp.cpp, excerpt on simulated annealing
In this session, we'll try out some basic examples of parallel
processing using MPI and OpenMP and
consider a toy problem comparing multidimensional minimization with
standard downhill routines for local minima
to an application of Monte Carlo techniques to find
the global minimum:
Introduction to Parallel Processing with MPI
This is a very brief introduction by example to using "Open MPI", which is
an implementation of the "Message Passing Interface" or MPI (a library
and software standard). We've set things up so we can use
the computers in Smith 1094 running Linux as parallel processors.
Communication between machines is carried out via ssh but we'll
set it up so you don't need to enter passwords repeatedly.
If your computer is not in Linux already, restart it now.
In a message passing implementation of parallel processing, multiple
"tasks" are created, each with a unique identifying name, which run on
different computers (multiple processes can run on one machine as
well). The tasks talk to each other by sending and receiving
A "message" is an array with elements that include data, identifiers
for the source and destination processes, and a tag used to distinguish
Usually a fixed set of processes is created when the program is
initialized but the processes can be executing different programs
(this is referred to as "Multiple Instruction Multiple Data" or MIMD).
There are many functions defined in the MPI library. We'll look
at a few of the most important and heavily used functions here. See
the documentation list below for more information.
- Setting up ssh.
This procedure will let you login to all of the machines in 1094
without giving a password every time. (Skip any steps you've
already done before.)
- Generate the files ~/.ssh/id_rsa and
~/.ssh/id_rsa.pub with the command:
ssh-keygen -t rsa
(hit return at the "Enter file" request). When prompted,
enter a passphrase
(this is like a password, but generally longer; a good example
is "physics780rocks" :), which you
will need to remember. Were id_rsa and
- Copy ~/.ssh/id_rsa.pub to the file
(you will have to create this file if it does not already exist).
Set the permissions with: chmod 600 authorized_keys
- Normally you would need to type ssh-agent $SHELL, but
should already be running (check with ps aux | grep agent).
- Give the command ssh-add. You will be prompted (just
this once!) for your passphrase. If successful, you will get
an "Identity added:" message.
- Check that you can ssh without a password by connecting
to at least four other computers (logging out each time; this only works
from your original computer, where you typed ssh-add).
The computer names in Smith 1094 start
with "n1094" and end with a, b, c, ..., r. The first time you
ssh, you will be asked if you want to connect. Answer "yes"
and then you won't be asked again.
Did it work?
- Setting up the environment.
You need to add "/usr/local/openmpi-1.4.1/bin" to your PATH (which is searched
to find programs) and
"/usr/local/openmpi-1.4.1/lib" to your LD_LIBRARY_PATH (which is searched to find libraries to link to).
You can set these by separate commands in your shell scripts,
but here we'll do it with the module program:
module load openmpi-1.4.1-gnu
(give the command module avail to see all of the
Check your shell with the command: printenv SHELL.
If you are using tcsh as your shell (most of you),
add the module command to your .cshrc.path file
(create this file in your main directory if it doesn't exist).
If you are using bash as
your shell, put the module command in your .bashrc.
To activate the module command the first time, type
(or source .bashrc). It will be automatically
sourced when you login (or ssh to another computer).
- MPI Hello, World! program.
- Go to the "mpi_examples" directory and look at the
"make_mpi_hello_world" makefile. Note that it compiles the
"mpi_hello_world.cpp" code using mpiCC, which for us invokes
g++ as the compiler. We've also limited the warning flags.
- The "mpi_hello_world.cpp" code illustrates some of
the basics of using MPI
with a workstation cluster. Look through the printout and then
compile and link using the makefile. To run it, we will
create a list of machines to use in starting up processes.
Create a file called
"machines.780" (or something like that, the actual name doesn't
matter) and put in a list of four computers to run on. Here is an
example (lines begining with "#" are comments and don't need
to be included):
# Change this file to contain the machines that you want
# to run MPI jobs on. The format is one host name per line,
# with either
# where n is the number of processors in an SMP. The hostname
# should be the same as the result from the command "hostname".
(We used n1094a instead of n1094a.mps.ohio-state.edu because that is
what is returned from "hostname").
To check whether a particular computer is available (i.e., turned on
and booted into Linux), for
example n1094b, give the command "ssh n1094b ps" (you'll either
get output on running processes, which means it is available, or a
"not available" response).
- Run the code with the command:
mpiexec -machinefile machines.780 -np 4 mpi_hello_world
The "-np 4" option tells MPI to use four processes.
Note that the same program runs independently ("asynchronously") four
times (these are the four processes), on four different computers in this
- Change the global array in the program to have 10 unique entries
and try running the code with 10 processes (but with the same
machine list that has only 4 computers listed). What happens?
Run it a few times; do you always get responses in the same order?
Do you gain anything by running more processes than computers?
Do you lose anything?
- Send and Receive.
In the program from the last section, there was no communication
between the processes running in parallel.
Here we use a program that demonstrates two of the
functions to send and receive messages (there are more functions!).
The program doesn't do anything
but send numbers back and forth, but can easily be generalized to more
- Take a look at the "mpi_send_receive.cpp" printout.
The part of the code that does the work is quite small; most
of the code is taken up by error
checking and processing, which is implemented using C++'s
"try" and "catch" formalism.
You'll find the standard calls to Init, Get_rank, and Get_size at
the beginning, and Finalize at the end. The processes are divided
into the "master" process (rank 0), which coordinates the
calculation, and the "slave" processes (all the others).
Send and Receive functions are used to transmit data (in this
case, just an integer) between master and slaves.
- Compile and link the code with "make_mpi_send_receive",
set up a machinefile (e.g., machines.780), and run with
four processors using
mpiexec -machinefile machines.780 -np 4 mpi_send_receive.
Verify that the program works and try running with 20 processes.
How are the "tags" being used?
- Modify the code so that a different test integer
is sent to each slave process (e.g., send the test integer read from
the terminal added to
the rank of the slave process), and then have
the slave processes send back a double
(MPI::DOUBLE), rather than an integer, equal to the square root
of the integer sent.
Did it work?
- Try modifying the code so that the Master waits for each message
in turn from the slaves, rather than receiving from any source
What did you do?
- Pick one code from the quarter and describe how
you would use MPI Send and Receive to run it in parallel.
- Calculating pi in parallel.
Here we explore a program to calculate pi in parallel
using "collective communication" functions. This means that
information is distributed to and collected from the processes
all at once (rather than communicating with each one-to-one).
[Note: The infinite series used here
is not an efficient method to calculate pi!]
- Take a look at the "mpi_calculate_pi.cpp" printout.
Figure out from the code what series is being summed and how it
is being split among the processors.
Verify the sum using Mathematica.
- Compile and link the code with "make_mpi_calculate", set up
a machinefile (e.g., machines.pi), and run with four processors using
mpiexec -machinefile machines.pi -np 4 mpi_calculate_pi.
Try 1, 2, 3, and 10000 intervals.
When you quit you'll get a summary of elapsed time for each
- Now run with 10 processes and step through different numbers of
Do the processors report in order?
- Modify the code so that you get the time each processor takes
for one interval (instead of all at once).
How much faster
is it to calculate 100,000 intervals on 4 machines than 1?
- Modify the code so that it inputs an integer and calculates
the product of integers from one up to the input integer.
- More information on MPI.
Introduction to Parallel Processing with OpenMP
OpenMP (not to be confused with Open MPI!) implements parallel processing
when there is shared memory, as is common these days with
multi-core hardware. If you have a dual-core processor that means that
in principle you can run your program twice as fast by running
two "threads" of your code
simultaneously on the two cores. One way to achieve this
in practice is to use OpenMP. We'll use a simple example written
by Chris Orban to illustrate how it works.
- The 1094 machines each have only one core, so you'll use one
of the public Linux machines (such as fox or canary). Log in
with ssh -Y fox (or whatever machine you choose) and copy
the program while in your working directory (which may be your login
directory) with cp ~ntg/public_html/simpson_cosint_openmp.cpp . (don't
forget the period, which means the current directory).
- Look at the program in an editor or in the printed copy. The
key features are the omp include statement and the #pragma omp
statement, which is an instruction to the compiler. What part
of the program is executed in parallel?
- Let's compare using one and two threads. There is a built-in
timer in the program, but this can be misleading so give the command
set time = 3, which will automatically print the cpu usage
and the wall time after you run the program.
Compile the program with g++ following the instructions in the comments, and
then run it. Record the CPU time (first number), the wall time
(third number), and the percent of the CPU used (fourth number).
Then change to using one thread by typing:
setenv OMP_NUM_THREADS 1 and re-run the program.
Record the numbers again.
Why do you think the CPU time is about the same but the wall
time differs? Is the parallelization working?
- (BONUS) For completeness, you can try the Intel C++ compiler, following
the instructions in the top comment section of the program. Does
it behave the same way as C++?
Standard optimization methods are very good in
finding local minima near where the minimization was started, but not
good at finding the global minimum. Finding the global minimum
of a function (such as the energy) is often (but not always) the goal.
One strategy using conventional minimizers is to start the minimization
at many different places in the parameter space (perhaps chosen at
random) and keep the best minimum found.
Here we'll adapt our Monte Carlo algorithm for generating a canonical
Boltzmann distribution of configurations at a temperature T to mimic how
physical systems find their ground states (i.e., the energy minimum at
T=0). At high temperature (compared to characteristic energy
spacings), the equilibrium distribution will include many states. If
the system is cooled slowly, then it will have the opportunity
to settle into the lowest energy state as T goes to zero. This is
called annealing. If the system is cooled quickly, it can get
stuck in a state that is not the minimum ("quenching"); this is
analogous to the routines we looked at in Session 10, which rapidly go
"downhill" to find local minima.
The strategy here is to simulate the annealing process by treating the
function to be minimized as an energy (it might actually be an energy!),
introducing an artificial temperature T, generating a sequence of states
in a canonical distribution via the Metropolis algorithm.
Then we lower the temperature according to a "schedule" and let the
system settle into (hopefully!) a configuration that minimizes the
Global vs. Local Minimum
We first consider the simple example of a damped sine wave in one
dimension, which has
many local minima.
f(x) = e-(x-1)2sin(8x)
(This example is from the GSL documentation.) Suppose we start with an
initial guess of 15.5; let's see how
"downhill" minimization methods compare to
- Plot the function in Mathematica (see test_function.nb)
so we know what to expect ahead of
time (remember that in general we won't be able to
do this!). Use FindMinimum starting near 15.5 to find the local
minimum. Then find the global minimum (using the graph to identify
a good starting value).
- The code multimin_sa_compare.cpp (compile and link with
make_multimin_sa_compare) applies the GSL routines we used in a
previous session to find the minimum and then applies a simple
simulated annealing algorithm. Run the program and look at the code to
figure out where the simulated annealing control parameters (starting
and finishing temperatures, maximum step size, cooling rate, # of
iterations at each temperature) are set.
How well does each method work?
- Now adjust the simulated annealing control parameters to give it a
chance! The step size should be large enough (or small enough)
so that successes are neither assured nor happen too infrequently
(50% would be ideal!).
The initial temperature should be considerably larger
than the largest typical delta_E normally encountered.
You shouldn't cool too rapidly. Make these adjustments and rerun the
code. What did you do to get it to work best?
- Just to check that the Metropolis algorithm is doing something,
once the code is working, change the line with "random < exp()"
to "random > exp()" and verify that it no longer works.
- Bonus: Switch the code to use classes defined in previous
780.20: 1094 Session 15.
Last modified: 10:45 am, March 09, 2011.