Documentation/HPC cluster

From systems
(Redirected from Documentation/Titan cluster)
Jump to: navigation, search

Overview of the cluster

HPC compute cluster

The primary, general purpose compute cluster at C2B2 is now called "HPC." Titan was the cluster since 2008 and has been replaced with a newer more robust system. HPC is a Linux-based (CentOS 6.5) compute cluster consisting of 528 HP blade systems, 2 large (1TB) memory servers, two head nodes and a virtualized pool of login (submit) nodes. The nodes fit in a dense configuration in 12 high-density racks and are cooled by dedicated rack refrigeration systems.

Each blade node has twelve cores (two hex-core processors), and either 32GB (480 nodes) or 96GB (48 nodes) of memory. This provides 6,336 compute cores and 20 TB of total RAM (memory). Additionally there are 2 nodes with 1TB of memory and 16 actual cores (with hyper-threading 32 cores). Each node has a 10 Gbps ethernet connection to the cluster network. Additionally, 32 (256 CPU-cores) of the nodes are linked with 40 Gbps QDR InfiniBand. Additionally, a set of login nodes running Citrix XenServer virtualization provide a pool of virtual login nodes for user access to this and other systems.

Like our previous clusters, the system is controlled by Sun Grid Engine (SGE), but with some important differences in configuration from the previous clusters (detailed below).

Storage for the cluster is provided exclusively by our Isilon clustered filesystem, which provides multiple (presently 36) 10 Gbps points of access for NFS storage and a scalable configuration and over 1 PB of total capacity.

Accessing the system

Access is provided to HPC through a pool of 5 login nodes. Simply SSH to login.c2b2.columbia.edu and login with your ARCS domain login and password. You will automatically be placed on one of our login nodes. Please note, no computation should take place on the login nodes. The login nodes are intended as a gateway resource to our storage and computing systems only. Queueing jobs is discussed below.

Using the storage

HPC exclusively mounts the Isilon filesystem (/ifs). The /ifs filesystem is broken up into four sections: /ifs/home, /ifs/data, /ifs/scratch and /ifs/archive. Users should have folders in at least home and scratch, and most will have folders in data and archive as well. /ifs/home is where your login home directory lives. It has regular filesystem snapshots and backups. /ifs/scratch is where most cluster computing should happen. It has no snapshots or backups, but provides high-speed storage for computing. /ifs/data is a medium between home and scratch. It has no regular snapshots, but it is backed up once per month to provide disaster recovery. This area is intended for items like computationally generated databases that need to be backed up, but are also written to by the cluster. Finally, /ifs/archive provides low-cost archival storage for data that is no longer active. /ifs/archive space is accessible from the login nodes but not the compute nodes on the cluster.

!!!IMPORTANT!!! Because cluster generated data should not usually be backed up, and filesystem snapshotting does not deal well with the large amounts of filesystem changes that cluster computing tends to have, /ifs/home should never be used as a directory for cluster computing. With this in mind, /ifs/home is mounted read-only on the cluster backend nodes. One of the most common early problems users have with running on hpc is accidentally trying to run out of the /ifs/home director. Jobs that try to do this will go into "error" state and will not run.

You can see your filesystem quotas by using: $ df –h <your_directory> while on the login nodes. Your directory (or lab’s directory in some cases) presents itself as an independent disk drive with a maximum size equal to your quota. Note that this will not work on the compute nodes (e.g. through qrsh).

There are some important guidelines for storage use that are specific to cluster computing:

  • Again, /ifs/home is read-only on the compute cluster. Your analysis tasks should be done in /ifs/scratch. Results that need to be backed-up and resource databases can be stored in /ifs/data. We don't allow write access to /ifs/home and discourage writing to /ifs/data from the cluster because of interaction with snapshots affects the performance of the Isilon storage system.
  • Do not store excessive amounts of files in a single directory. In general, storing more than a few thousand files in a directory can start to noticeably degrade storage performance. Storing tens of thousands of files or more can seriously damage over-all performance, and often requires significant time and effort to rectify. If you need to store a large number of files, you can either use archival techniques (e.g. tar, or cpio commands) or break the files up into a tiered directory structure.
  • Storing a large amount of very small files can take up much more space than you expect. Because of the way that large clustered filesystems work, a 4KB file (for instance) will really take up roughly 128KB of space. Small files can easily be archived into a single large file using the `tar` command in order to safe space and reduce filesystem clutter.

Temporary Files -- new for HPC cluster

Some programs create and manipulate small temporary files during computation (see, for instance, Running compiled Matlab jobs below). In some special cases, it may be more efficient to use the local disk on the node instead of the scratch space. On the HPC cluster there is now about 400GB of space on each cluster node for local temporary file use. You can put files in the directory contained in the $TMPDIR environment variable. The $TMPDIR directory is automatically setup by SGE when a job starts and is automatically removed when the job finishes.

For a more detailed description of storage and its use, see Storage architecture.

Finding and running applications

We support a limited number of applications on the cluster, such as Java, Perl, matlab & R. While we will try to maintain current versions of these applications, we cannot manage modules/classes that specific applications need. In general, if you need specific modules you should compile them into your home, data or scratch directory (or perhaps into the labs "share" directory). Applications can be found in /nfs/apps. These applications are available only on the backend nodes. We do not recommend that you use the versions included in the cluster operating system (e.g. /usr/bin/perl), as those are designed for running the cluster itself, and may change without warning.

If there is an application that you believe should be provide in /nfs/apps, you can send a request to rt <at> c2b2.columbia.edu; however, we will probably only agree to maintain applications that will be widely used by multiple labs.

Queueing and SGE

HPC is controlled by Sun Grid Engine (SGE). There are several aspects of queuing that are unique to the HPC configuration:

Memory management

We have had consistent problems in the past with cluster jobs running nodes out of memory and thereby crashing the nodes. This is problematic because, not only does the user’s job die, but anyone else’s job that is on the same node dies as well. This becomes an even greater problem on HPC nodes where 12 jobs will be running on the same node. One user overflowing memory could cause 7 other people’s work to fail.

In order to solve this problem, the queuing system requires that all jobs be submitted with hard memory limit values. This will allow the queuing system to ensure that all jobs end up on nodes that can support their memory requirements and also enforce that jobs stay within the bounds they advertise. It is important to note that, if you try to exceed the limit you request your job will die. This prevents your job from causing the node to crash and disrupting the work of others.

Memory limits can be specified on the command line or through a qsub script by supplying the option “-l mem=<memory_spec>”, where <memory_spec> is of the form, e.g. 2G, 512M, etc. While you do not want to guess your memory requirements too closely to avoid having your job die, you also do not want to request significantly more memory than you will use because it will be harder to get your job through the queue since SGE has to find a slot with sufficient memory available.

If you do not specify the “-l mem=<memory_spec>” option, your job will run with a default 1GB memory limit. If your job requests more memory than this it will die without completing.

We recommend that you give yourself a 25% - 50% margin when selecting memory requirements to make sure your jobs do not unnecessarily die. The scheduler is built to compensate for these over-estimates. As a result, even if you know, e.g. that your job will use exactly 2GB of memory, you should request 2.5GB or 3GB.

Time based queuing

A time-based queuing system is being implemented that limits the number of slots available to a job based on the runtime limit of the job. This is in place of the hard limit on the number of jobs that a user can run in our present system. Similar to the memory settings, the time limit of a job can be given either on the command line or in a qsub script using the “-l time=<hh:mm:ss>” option, e.g. time=1:15:30 (1h, 15m, 30s), time=:20: (20m).

The time queuing system works by splitting the cluster into different time limit queues. The size and limits of the queues are optimized based on cluster usage. The queues have names like "std_6h.q". This indicates that the maximum runtime for a job in the queue is 21,600 seconds. You need not worry about telling the system which queue to go to. If you specify the “-l time=<hh:mm:ss>” parameter the queuing system will automatically find the queues you can fit in. As an example, there are presently seven regular queues (plus special queues for interactive, Infiniband, and GPU jobs): std_6h.q std_12h.q std_1d.q std_2d.q std_1w.q std_2w.q std_1m.q. For example, a job with a 16 hour limit will automatically be placed in the shortest (time) queue available to it, std_1d.q.

This table illustrates the current bin sizes (this may change over time):

bin max runtime CPUs available
std_6h 6 hr 396
std_12h 12 hr 660
std_1d 24 hr 792
std_2d 48 hr 924
std_1w 1 week = 168 hr 792
std_2w 2 week = 336 hr 660
std_1m 1 month = 672 hr 396

New scheduling priority algorithm (sharetree + functional)

As you use the system, you will be begin to notice that your waiting jobs have a "priority" value assigned to them. As the name suggests, this value helps the queuing system decide which jobs will run in which order. There are multiple variables that are used to determine the priority for any particular job. The can be broken as follows:

  • Functional scheduling policy: The functional scheduling policy tries to enforce an equal distribution of users at any given time. Each user is equally weighted, and as users enter the queue, the priorities will be adjusted to distribute jobs fairly.
  • Sharetree scheduling policy: While the functional scheduling policy enforces a fair distribution of jobs at any instance, share tree queuing ensures that over a longer period of time usage remains fair. This means that a user who has been running a lot of jobs over, say, the last week, will be given lower priority than a user who is submitting jobs for the first time in a while.
  • Urgency: Urgency scheduling adjusts priority based on the resources requested and the length of time the job has been waiting. In general, jobs that request more resources will be given higher priority to adjust for the difficulty of scheduling larger resources. Additionally, the urgency policy will make the jobs priority go up over time in an attempt to ensure that no job has to wait too long compared to other jobs.
  • User priority: Using the "-p" option, users can set a priority that will only affect their own jobs. The user priority will help to control which of a particular users' jobs will run in which order.

Putting it together: an example session

We are aware that users may not know exactly how long their program is going to run or how much memory they will need for the program. When this is the case, we recommend that you run the program a small number of times asking for more resources than you expect to use and use the `qacct –j <job_id>` command to determine the usual resources required by the program. qacct is available on login nodes (currently not available in the interactive sessions.) For example, if you have a program called `workhard` that you would like to use on the cluster, the following might be a reasonable workflow for getting it successfully running:

1. qlogin to an interactive node to test to make sure the program works on the cluster:

$ ssh user@login.c2b2.columbia.edu
$ qlogin –l mem=2G,time=:20:
$ cd /ifs/scratch/c2b2/my_lab/user
$ ./workhard

Program returned successfully.

$ exit

2. submit a qsub job with larger resources than expected:

-- BEGIN: workhard.sh --
#!/bin/bash
#$ -l mem=4G,time=1:: -S /bin/bash –N workhard –j y –cwd
./workhard
-- END: workhard.sh --

And run:

$ qsub workhard.sh
Your job 144 ("workhard.sh") has been submitted
<wait for job to complete> 
$ qacct –j 144
start_time   Wed May  6 13:34:34 2009
end_time     Wed May  6 13:56:04 2009
maxvmem      1.05G

3. Now I know that my program uses a little more than 1G of memory and runs for about 25 minutes. I can now adjust and submit multiple jobs:

-- BEGIN: workhard.sh --
#!/bin/bash
#$ -l mem=1.2G,time=:30: -S /bin/bash –N workhard –j y –cwd
./workhard 
-- END: workhard.sh --

And run:

$ qsub –t 1-40 workhard.sh

4.) Interactive session running matlab.
ssh to login node

-bash-4.1$ hostname
login3.hpc

As you can see we are on login node 3. Let's use qrsh command to start interactive session for up to 30 minutes with 4G of ram.

-bash-4.1$ qrsh -l mem=4G,time=:30:

When we scheduled on the node we would be back to prompt. Let's check if we were successful.

-bash-4.1$ hostname
ha5c6n8.hpc
-bash-4.1$ qstat
job-ID  prior   name       user         state submit/start at     queue                          jclass                         slots ja-task-ID
------------------------------------------------------------------------------------------------------------------------------------------------
8629565 1.00500 QRLOGIN    fs2307       r     05/16/2017 12:53:47 int.q@ha5c6n8.hpc                                                 1

As you can see we are now on one of the cluster nodes and no longer on the login nodes and we have interactive session running with job id 8629565 If we crash and we think it might be memory related we would be able to use qacct -j 8629565 command to see memory usage and status. Now let's load matlab module (you can manually set path or just use full path to call version of matlab you want instead.)

bash-4.1$ module load matlab/2016a
-bash-4.1$ which matlab
/nfs/apps/matlab/2016a/bin/matlab

Let's start matlab.

-bash-4.1$ matlab -nodisplay -nojvm -singleCompThread

                                                                            < M A T L A B (R) >
                                                                  Copyright 1984-2016 The MathWorks, Inc.
                                                                   R2016a (9.0.0.341360) 64-bit (glnxa64)
                                                                             February 11, 2016  


For online documentation, see http://www.mathworks.com/support
For product information, visit www.mathworks.com. 


        Academic License

At this point I can type commands and see execution result.

>> echo on;
>> x=1;
>> disp(x)
    1

>> y=x+2;
>> disp(y)
     3

>> exit;

I exited matlab. Let's make sure I am still in interactive session.

-bash-4.1$ hostname
ha5c6n8.hpc

Ok I am still there. If I want to end it now I can just type exit.

-bash-4.1$ exit
Logout
-bash-4.1$ hostname
login3.hpc

we back to login node. Let's see how much resources we used

qacct -j 8629565
==============================================================
qname        int.q
hostname     ha5c6n8.hpc
group        sysops
owner        fs2307
project      sysops
department   sysops
jobname      QRLOGIN
jobnumber    8629565
taskid       undefined
account      sge
priority     0
qsub_time    Tue May 16 12:52:36 2017
start_time   Tue May 16 12:53:47 2017
end_time     Tue May 16 12:57:45 2017
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 238
ru_utime     6.807
ru_stime     1.094
ru_maxrss    213492
ru_ixrss     0
ru_ismrss    0
ru_idrss     0
ru_isrss     0
ru_minflt    151599
ru_majflt    532
ru_nswap     0
ru_inblock   355648
ru_oublock   5336
ru_msgsnd    0
ru_msgrcv    0
ru_nsignals  0
ru_nvcsw     7924
ru_nivcsw    8281
cpu          7.901
mem          3.946
io           0.042
iow          0.000
maxvmem      1.388G
arid         undefined
jc_name      NONE  

By looking at maxvmem we can see that we haven't go about 1.3888G of ram. So for this very basic calculation requesting 4G was a bit too much and 2G probably was safe enough.
more qacct examples

Common Problems when migrating from older SGE clusters (titan)

Some of the scripts might point to location using current symbolic link. for example

/nfs/apps/matlab/current

This might cause it to be resolved to a different version of application and might cause script to fail. In general for scripts that needs to be run multiple times over long period of time and pipeline jobs we recommend to not use current and always use path to specific version of application for example

/nfs/apps/matlab/2013b 

If new version of application is available current would be switch to latest release and that might cause older script to break. Don't use current unless you script expected to work with changing version of the application and always use latest one.

The issue with loading .bash_profile and .bashrc that some users had in the first wave of migration was fixed and now for all queues .bash_profile .bashrc are loaded automatically as they were on titan. so having -l on the first line of submission script is no longer required.

Final notes

Please offer feedback on the cluster and we will work to make it as reliable, powerful and usable as possible.

To request access to HPC, please send a ticket to rt <at> c2b2.columbia.edu, or through the web interface at https://rt.c2b2.columbia.edu

Additional notes & examples

Module system and Available Software

See Module System

Access to different GCC on CentOS 7

Multiple versions of GCC are available on CentOS 7. To access them use devtoolset system.

currently following devtoolsets are installed

devtoolset-3 - gcc 4.9.2-6
devtoolset-4 - gcc 5.3.1-6
devtoolset-6 - gcc 6.3.1-3
devtoolset-7 - gcc 7.3.1-5
devtoolset-8 - gcc 8.2.1-3

To enable specific toolset use scl command See following example

$gcc --version | grep GCC
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)
$scl enable devtoolset-6 bash
$ gcc --version | grep GCC
gcc (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)


Access to old CentOS 6 nodes

To run your job on old CentOS6 nodes you have to add

-l centos6=1 

to your job submission parameters.

PS: There are only few nodes available running old nodes, so wait time for those are longer. Certain combinations of memory request might not be possible
(For example there are no nodes with more than 256G of ram running CentOS 6)

Running an array job

Many analysis tasks involve running the same program repeatedly with slightly different parameters. In these cases, it is much easier, both for the user and for the queuing system, to submit all of these jobs as one array job. Array jobs submit the same script a specified number of times and increment the environment variable SGE_TASK_ID for each instance. Array jobs appear as a single job with many different tasks.

In order to submit an array job you need to supply the array option ('t') to your qsub, e.g.

$ qsub -t <num_tasks> job.sh

where <num_tasks> is the number of instances of this job that should be run. Each task will be run with one of the integer values between 1 and <num_tasks> (note: you can be more specific about how SGE iterates over this variable, e.g. only having odd task IDs. see `man qsub` for details). It is up to your script to decide how to handle that variable.

There are many ways to use the SGE_TASK_ID variable, but one very common task is simply to run the same program with multiple different sets of arguments. Here is an example (which can easily be generalized) of how you might run an array job.

Suppose you have several different sequences, seq-A.fa, seq-B.fa, seq-C.fa, etc.. You want to use blastp against a database "some_db" for each of these sequences. Rather than submit these all individually, here's a quick and easy way to submit an array job.

Assume that everything lives in the same directory for convenience.

1. Create a file that contains one sequence filename per line, e.g.:

ls seq-* > sequences-files

2. Create the following script called "blast.sh":

#!/bin/bash
# Setup qsub options
## -l mem=1G,time=1:: -cwd -S /bin/bash -N blast
# If SGE_TASK_ID is N, set SEQ_FILE to the Nth line of "sequence-files"
SEQ_FILE=`awk 'NR=='$SGE_TASK_ID'{print}' sequence-files`
# Run blast with the $SEQ_FILE as input, and $SEQ_FILE.out as output
/nfs/apps/blast/current/bin/blastall -p blastp -d some_db -i $SEQ_FILE -o $SEQ_FILE.out

3. Submit the script with -t specifying the number of lines in "sequence-files": (in this case we will be doing range from 1 to number of lines)

qsub -t 1-`wc -l < sequence-files` blast.sh

The job will now run a task for each sequence file, outputting results to seq-A.fa.out, seq-B.fa.out, etc.

Running compiled Matlab jobs

Compiled matlab jobs present a couple of challenges to running on any large cluster. Matlab uses a cache directory, called the MCR cache, to extract and manage the tools and libraries that each executable needs to run. By default this would be created in your home directory. Since the home directory is read-only, the location of the cache needs to be set to a writeable area.

Additionally, running multiple concurrent jobs in the same cache directory causes the cache to become corrupt and can cause erratic behaviors and failures from the matlab jobs. To ensure cache coherency, the cache directory should be set to a differently location for each job. Because these cache directories can start to build up, it is important to clean-up old cache directories. You can either direct the cache to, e.g. the /ifs/scratch filesystem, or you can use the $TMPDIR environment variable which points to a special directory on the local hard disk of the node that will automatically be deleted once the job completes.

NOTE: The MCR cache is very small (< 32M) so it is fine to use the $TMPDIR to store the cache; 
however, HPC is not designed to support larger usage of the $TMPDIR since the local disks are small and slow.

The appropriate environment variables to set are:

MCR_CACHE_SIZE - Sets the maximum size of the cache (should probably be the default, 32M)

MCR_CACHE_ROOT - The location of the MCR cache

The following is an example qsub script for running a compiled matlab job on HPC:

#!/bin/bash
#$ -cwd -S /bin/bash -N matlab_job
export LD_LIBRARY_PATH=/nfs/apps/matlab/current/bin/glnxa64:/nfs/apps/matlab/current/sys/os/glnxa64:$LD_LIBRARY_PATH
export MCR_CACHE_SIZE=32M
export MCR_CACHE_ROOT=$TMPDIR/mcr_cache
mkdir -p $TMPDIR/mcr_cache

/path/to/your/executable <arguments>

Please keep in mind. In this example we used /nfs/apps/matlab/current as base location for Matlab. It is better to specify version you want to use in your script when using compiled script. In the future as new versions are available current might be switched to point to newer version and might cause old script to break if there are incompatibility between versions libraries.

Running with resource reservations

Some cluster jobs require a very large amount of concurrent resources (e.g. CPUs or RAM) to run, and it may be difficult for the scheduler to find an open resource to run the job. To help these jobs get scheduled in a timely way, SGE uses "resource reservations." A resource reservation puts placeholders in the queueing system to make sure that the necessary resources free up as soon as possible.

It works like this: suppose you want to submit a 32-CPU MPI job (see below for MPI examples). If the cluster is busy, there may not be 32 applicable slots available at any given moment to start the job. Without reservations, the queuing system will just keep trying each cycle to see if 32 suitable CPUs happen to become available. In practice, for large jobs in a busy system, this will result in almost indefinite wait times.

With the resource reservations enabled, the queuing system will pre-decide which slots the job is going to run in (based on an internal optimization algorithm). Some of these slots may have currently running jobs in them, and some may be free. The queuing system will make sure that nothing occupies the reserved slots in such a way as to delay the start time of the waiting job.

So, say, your 32-CPU job is waiting on 16 currently occupied slots. The queuing system (because of our time request policy) knows about how long each of the running jobs will last. If it knows that the last of the 16 currently running jobs will free up in 30 minutes, it will make sure no new jobs will be scheduled in the reserved slots that will finish later than 30 minutes from now (Though it may schedule short running jobs in reserved slots that will finish in 30 minutes or less. This is a process known as "backfilling.").

Because computing resource reservations is a computationally intensive task for the scheduler, reservations are not turned on by default. In general, it is only necessary for jobs that require exceptional amounts of CPUs or RAM. To enable resource reservations, simply add the "-R y" option to your qsub, e.g.

$ qsub -R y job.sh

Running an SMP (multi-threaded) job

When you request a slot on hpc you are allocated a single CPU core. SMP jobs run on multiple CPU cores and therefore require more resources than a single slot allocates. In order to correctly run a multi-threaded job on hpc you need to use a parallel environment. The parallel environment for SMP jobs is called "smp." You can request an SMP job using the '-pe' option when you qsub your job. The syntax for the '-pe' option is:

-pe smp <number_of_threads>

You can pass this option either through the command line like this:

qsub -pe smp 4

which would allocate enough cores for a 4 threaded job. Or, you can specify the same option in your qsub file like:

#$ -pe smp 4

The maximum theoretical number of threads you can request on hpc is 12 since the hpc nodes have 12 cores each. Because all of the cores must be allocated on the same node, requesting 12 cores will require an entire node to be free which may take longer to queue. In some cases, it may be more efficient to use fewer threads. We advise our users to run 8 or less threads in their requests!


Running an OpenMPI job

This section is under construction MPI (Message Passing Interface) is a method for running parallel jobs across multiple nodes. This gets past the problem mention in the SMP section of having to allocate all of the resources on a single node, and as such can scale in some cases to hundreds or thousands of cores for a single job. However, your program must be specifically programmed to use MPI.

The most common form of MPI found these days is OpenMPI. Which have current versions of OpenMPI supported on hpc. In order to use OpenMPI you will need to use an 'parallel environment'. The parallel environment for OpenMPI is called "orte." As in the SMP case, the syntax is:

-pe orte <number_of_slaves>

where <number_of_slaves> is the number of cores requested for the MPI job.

In addition to running the job in the appropriate parallel environment, you need to run your job with mpirun. You need to setup your library path for OpenMPI and run the appropriate mpirun command. The MPI libraries are located in:

/nfs/apps/openmpi/current/lib

(current is a symbolic link to the latest openmpi version).

To setup your execution, library and man paths in your script environment, you need to source the following shell script

. /nfs/apps/openmpi/current/setenv.sh

in your job script.

SGE has "tight integration" with OpenMPI, so you shouldn't need to pass any parameters to mpirun. It will automatically know how many slaves it can run and where to run them.

As an example, say you want to run your program "mpiprog" on 32 cores. You might use the following script:

#!/bin/bash
#$ -cwd -S /bin/bash -N mpiprog
#$ -l mem=1G,time=6::
#$ -pe orte 32
. /nfs/apps/openmpi/current/setenv.sh
mpirun mpiprog

This will dispatch a 32 core OpenMPI job on the cluster.

Note: your memory resource request is the memory per slave, not the total memory of your job. The above example will allocate a total of 32GB of memory across the cluster.

If your MPI job is "communications bound" (i.e. pass many large messages or many very quick messages) it may be to your advantage to use the InfiniBand enabled section of the cluster. InfiniBand is a very high-speed, low latency interconnect that OpenMPI has optimizations to use. There are a total of 256 cores available that are linked with InfiniBand.

In order to use the InfiniBand portion of the cluster you will want to specify that your job needs the infiniband resource. To do this, simply add the following line to your file:

#$ -l infiniband=TRUE

That's it. OpenMPI should automatically figure out that all of your slaves have access to InfiniBand and will automatically select it as the default communication fabric.

Note: "ib" is an alias for the "infiniband" resource, also (0,1) map to (FALSE,TRUE), so the following also request the infiniband resource:

$# -l ib=1
$# -l infiniband=1
$# -l ib=TRUE

A few MPI programs are very picky about fragmentation (i.e. getting split across too many different nodes). If your job is picky about fragmentation you can use the orte_8 parallel environment that will ensure you get 8 CPU's per node for your job (i.e. whole nodes). This should be avoided if possible, because it may result in long wait times for your job to be scheduled.

A final note: running large parallel jobs can be a very particular task and often requires extra testing and care. For instance, you cannot always assume that your program will simple run faster if you run it across more slaves. In most cases, MPI enabled programs perform better and better as you add more slaves until a point. After that point, the jobs will actually start performing worse. It's a good practice to run several instances of your MPI job at different numbers of slaves the first time and compare performance, e.g. run jobs with 2,4,8,16,32,64,128 and 256 slaves and see how the performance scales.

Interested in enabling MPI in your program? The following links might be helpful:

Job usage

Use qstat to find out about job status and resources while it's in pending or running status. See manual for qstat

man qstat

Running qstat without any parameters will return list of currently pending and running jobs for current user.
Most typical example would involve calling qstat with job number

qstat -j <jobnumber>

For job that finished running you will need to use qacct
qacct examples