780.20: 1094 Session 15

Handouts: Printouts of mpi_hello_world.cpp, mpi_send_receive.cpp, mpi_calculate_pi.cpp, multimin_sa_compare.cpp, simpson_cosint_openmp.cpp, excerpt on simulated annealing

In this session, we'll try out some basic examples of parallel processing using MPI and OpenMP and consider a toy problem comparing multidimensional minimization with standard downhill routines for local minima to an application of Monte Carlo techniques to find the global minimum: simulated annealing.

Introduction to Parallel Processing with MPI

This is a very brief introduction by example to using "Open MPI", which is an implementation of the "Message Passing Interface" or MPI (a library and software standard). We've set things up so we can use the computers in Smith 1094 running Linux as parallel processors. Communication between machines is carried out via ssh but we'll set it up so you don't need to enter passwords repeatedly. If your computer is not in Linux already, restart it now.

In a message passing implementation of parallel processing, multiple "tasks" are created, each with a unique identifying name, which run on different computers (multiple processes can run on one machine as well). The tasks talk to each other by sending and receiving "messages". A "message" is an array with elements that include data, identifiers for the source and destination processes, and a tag used to distinguish between messages. Usually a fixed set of processes is created when the program is initialized but the processes can be executing different programs (this is referred to as "Multiple Instruction Multiple Data" or MIMD).

There are many functions defined in the MPI library. We'll look at a few of the most important and heavily used functions here. See the documentation list below for more information.

  1. Setting up ssh. This procedure will let you login to all of the machines in 1094 without giving a password every time. (Skip any steps you've already done before.)
    1. Generate the files ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub with the command: ssh-keygen -t rsa (hit return at the "Enter file" request). When prompted, enter a passphrase (this is like a password, but generally longer; a good example is "physics780rocks" :), which you will need to remember. Were id_rsa and id_rsa.pub created?

    2. Copy ~/.ssh/id_rsa.pub to the file ~/.ssh/authorized_keys (you will have to create this file if it does not already exist). Set the permissions with: chmod 600 authorized_keys
    3. Normally you would need to type ssh-agent $SHELL, but ssh-agent should already be running (check with ps aux | grep agent).
    4. Give the command ssh-add. You will be prompted (just this once!) for your passphrase. If successful, you will get an "Identity added:" message.
    5. Check that you can ssh without a password by connecting to at least four other computers (logging out each time; this only works from your original computer, where you typed ssh-add). The computer names in Smith 1094 start with "n1094" and end with a, b, c, ..., r. The first time you ssh, you will be asked if you want to connect. Answer "yes" and then you won't be asked again. Did it work?

  2. Setting up the environment. You need to add "/usr/local/openmpi-1.4.1/bin" to your PATH (which is searched to find programs) and "/usr/local/openmpi-1.4.1/lib" to your LD_LIBRARY_PATH (which is searched to find libraries to link to). You can set these by separate commands in your shell scripts, but here we'll do it with the module program:
       module load openmpi-1.4.1-gnu
    (give the command module avail to see all of the available modules). Check your shell with the command: printenv SHELL. If you are using tcsh as your shell (most of you), add the module command to your .cshrc.path file (create this file in your main directory if it doesn't exist). If you are using bash as your shell, put the module command in your .bashrc. To activate the module command the first time, type source .cshrc (or source .bashrc). It will be automatically sourced when you login (or ssh to another computer).
  3. MPI Hello, World! program.
  4. Send and Receive.

    In the program from the last section, there was no communication between the processes running in parallel. Here we use a program that demonstrates two of the "point-to-point communication" functions to send and receive messages (there are more functions!). The program doesn't do anything but send numbers back and forth, but can easily be generalized to more substantive tasks.

  5. Calculating pi in parallel.

    Here we explore a program to calculate pi in parallel using "collective communication" functions. This means that information is distributed to and collected from the processes all at once (rather than communicating with each one-to-one). [Note: The infinite series used here is not an efficient method to calculate pi!]

  6. More information on MPI.

Introduction to Parallel Processing with OpenMP

OpenMP (not to be confused with Open MPI!) implements parallel processing when there is shared memory, as is common these days with multi-core hardware. If you have a dual-core processor that means that in principle you can run your program twice as fast by running two "threads" of your code simultaneously on the two cores. One way to achieve this in practice is to use OpenMP. We'll use a simple example written by Chris Orban to illustrate how it works.

  1. The 1094 machines each have only one core, so you'll use one of the public Linux machines (such as fox or canary). Log in with ssh -Y fox (or whatever machine you choose) and copy the program while in your working directory (which may be your login directory) with cp ~ntg/public_html/simpson_cosint_openmp.cpp . (don't forget the period, which means the current directory).
  2. Look at the program in an editor or in the printed copy. The key features are the omp include statement and the #pragma omp statement, which is an instruction to the compiler. What part of the program is executed in parallel?

  3. Let's compare using one and two threads. There is a built-in timer in the program, but this can be misleading so give the command set time = 3, which will automatically print the cpu usage and the wall time after you run the program. Compile the program with g++ following the instructions in the comments, and then run it. Record the CPU time (first number), the wall time (third number), and the percent of the CPU used (fourth number).

    Then change to using one thread by typing: setenv OMP_NUM_THREADS 1 and re-run the program. Record the numbers again. Why do you think the CPU time is about the same but the wall time differs? Is the parallelization working?

  4. (BONUS) For completeness, you can try the Intel C++ compiler, following the instructions in the top comment section of the program. Does it behave the same way as C++?

Simulated Annealing

Standard optimization methods are very good in finding local minima near where the minimization was started, but not good at finding the global minimum. Finding the global minimum of a function (such as the energy) is often (but not always) the goal. One strategy using conventional minimizers is to start the minimization at many different places in the parameter space (perhaps chosen at random) and keep the best minimum found.

Here we'll adapt our Monte Carlo algorithm for generating a canonical Boltzmann distribution of configurations at a temperature T to mimic how physical systems find their ground states (i.e., the energy minimum at T=0). At high temperature (compared to characteristic energy spacings), the equilibrium distribution will include many states. If the system is cooled slowly, then it will have the opportunity to settle into the lowest energy state as T goes to zero. This is called annealing. If the system is cooled quickly, it can get stuck in a state that is not the minimum ("quenching"); this is analogous to the routines we looked at in Session 10, which rapidly go "downhill" to find local minima.

The strategy here is to simulate the annealing process by treating the function to be minimized as an energy (it might actually be an energy!), introducing an artificial temperature T, generating a sequence of states in a canonical distribution via the Metropolis algorithm. Then we lower the temperature according to a "schedule" and let the system settle into (hopefully!) a configuration that minimizes the energy.

Global vs. Local Minimum

We first consider the simple example of a damped sine wave in one dimension, which has many local minima.
f(x) = e-(x-1)2sin(8x)

(This example is from the GSL documentation.) Suppose we start with an initial guess of 15.5; let's see how "downhill" minimization methods compare to simulated annealing.

  1. Plot the function in Mathematica (see test_function.nb) so we know what to expect ahead of time (remember that in general we won't be able to do this!). Use FindMinimum starting near 15.5 to find the local minimum. Then find the global minimum (using the graph to identify a good starting value). Answers:

  2. The code multimin_sa_compare.cpp (compile and link with make_multimin_sa_compare) applies the GSL routines we used in a previous session to find the minimum and then applies a simple simulated annealing algorithm. Run the program and look at the code to figure out where the simulated annealing control parameters (starting and finishing temperatures, maximum step size, cooling rate, # of iterations at each temperature) are set. How well does each method work?

  3. Now adjust the simulated annealing control parameters to give it a chance! The step size should be large enough (or small enough) so that successes are neither assured nor happen too infrequently (50% would be ideal!). The initial temperature should be considerably larger than the largest typical delta_E normally encountered. You shouldn't cool too rapidly. Make these adjustments and rerun the code. What did you do to get it to work best?

  4. Just to check that the Metropolis algorithm is doing something, once the code is working, change the line with "random < exp()" to "random > exp()" and verify that it no longer works.
  5. Bonus: Switch the code to use classes defined in previous sessions.

780.20: 1094 Session 15. Last modified: 10:45 am, March 09, 2011.