Documentation/How to run MPICH jobs

From systems
Jump to: navigation, search

Running MPI jobs on the clusters requires some changes to your SGE submission scripts. SGE will automatically allocate nodes for your job to run on and create a hosts file for mpirun. Before you begin you will need to check that the cluster is configured for MPICH (not all of them are) and where the "mpich" executable is.

To make sure the cluster is configured for MPICH, execute:

$ qconf -spl

This will list all of the parallel environments SGE knows about. One of the listed environments should be "mpich."

The "mpich" program lives in different places on different clusters, but is generally found in the /opt directory. There may be multiple versions of MPICH available.

Here is an example of an MPICH qsub file:

#!/bin/sh
#
# Your job name
#$ -N MPICH_JOB
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe request for MPICH. Set your number of processors here.
# Make sure you use the "mpich" parallel environment.
#$ -pe mpich NUMBER_OF_CPUS
#
# Run job through bash shell
#$ -S /bin/bash
#
# The following is for reporting only. It is not really needed
# to run the job. It will show up in your output file.
echo "Got $NSLOTS processors."
echo "Machines:"
cat $TMPDIR/machines
# Adjust MPICH procgroup to ensure smooth shutdown
export MPICH_PROCESS_GROUP=no
#
# Use full pathname to make sure we are using the right mpirun
PATH_TO_MPIRUN/mpirun -np $NSLOTS -machinefile $TMPDIR/machines MPICH_PROGRAM

In the above example, you will need to replace the following with real values:

MPI_JOB : The name you are assigning to this job.

NUMBER_OF_CPUS : Numer of CPUs you would like the job to run with.

MPICH_PROGRAM : The MPI program you are running

PATH_TO_MPIRUN : The full path to the "mpirun" executable (this is not the same on all of our clusters, and some clusters contain more than one "mpirun" for different MPI implementations)

To run this job, do exactly as you would with any other SGE job:

$ qsub job_script.sh

As with every job you submit to the clusters, we highly recommend you declare the resources that your program needs, especially memory. This will help SGE ensure that your job is run on a system with sufficient resources. To do this, add:

#$ -l vmem=MEMORY_REQUIREMENT

to the script, where MEMORY_REQUIREMENT is the maximum amount of memory you expect to use (e.g. 1G).

For more information, see $ man qsub