Documentation/HPC cluster
Contents
Overview of the cluster
The primary, general purpose compute cluster at C2B2 is now called "HPC." Titan was the cluster since 2008 and has been replaced with a newer more robust system. HPC is a Linux-based (CentOS 6.5) compute cluster consisting of 528 HP blade systems, 2 large (1TB) memory servers, two head nodes and a virtualized pool of login (submit) nodes. The nodes fit in a dense configuration in 12 high-density racks and are cooled by dedicated rack refrigeration systems.
Each blade node has twelve cores (two hex-core processors), and either 32GB (480 nodes) or 96GB (48 nodes) of memory. This provides 6,336 compute cores and 20 TB of total RAM (memory). Additionally there are 2 nodes with 1TB of memory and 16 actual cores (with hyper-threading 32 cores). Each node has a 10 Gbps ethernet connection to the cluster network. Additionally, 32 (256 CPU-cores) of the nodes are linked with 40 Gbps QDR InfiniBand. Additionally, a set of login nodes running Citrix XenServer virtualization provide a pool of virtual login nodes for user access to this and other systems.
Like our previous clusters, the system is controlled by Sun Grid Engine (SGE), but with some important differences in configuration from the previous clusters (detailed below).
Storage for the cluster is provided exclusively by our Isilon clustered filesystem, which provides multiple (presently 36) 10 Gbps points of access for NFS storage and a scalable configuration and over 1 PB of total capacity.
Accessing the system
Access is provided to HPC through a pool of 5 login nodes. Simply SSH to login.c2b2.columbia.edu and login with your ARCS domain login and password. You will automatically be placed on one of our login nodes. Please note, no computation should take place on the login nodes. The login nodes are intended as a gateway resource to our storage and computing systems only. Queueing jobs is discussed below.
Using the storage
HPC exclusively mounts the Isilon filesystem (/ifs). The /ifs filesystem is broken up into four sections: /ifs/home, /ifs/data, /ifs/scratch and /ifs/archive. Users should have folders in at least home and scratch, and most will have folders in data and archive as well. /ifs/home is where your login home directory lives. It has regular filesystem snapshots and backups. /ifs/scratch is where most cluster computing should happen. It has no snapshots or backups, but provides high-speed storage for computing. /ifs/data is a medium between home and scratch. It has no regular snapshots, but it is backed up once per month to provide disaster recovery. This area is intended for items like computationally generated databases that need to be backed up, but are also written to by the cluster. Finally, /ifs/archive provides low-cost archival storage for data that is no longer active. /ifs/archive space is accessible from the login nodes but not the compute nodes on the cluster.
!!!IMPORTANT!!! Because cluster generated data should not usually be backed up, and filesystem snapshotting does not deal well with the large amounts of filesystem changes that cluster computing tends to have, /ifs/home should never be used as a directory for cluster computing. With this in mind, /ifs/home is mounted read-only on the cluster backend nodes. One of the most common early problems users have with running on hpc is accidentally trying to run out of the /ifs/home director. Jobs that try to do this will go into "error" state and will not run.
You can see your filesystem quotas by using: $ df –h <your_directory> while on the login nodes. Your directory (or lab’s directory in some cases) presents itself as an independent disk drive with a maximum size equal to your quota. Note that this will not work on the compute nodes (e.g. through qrsh).
There are some important guidelines for storage use that are specific to cluster computing:
- Again, /ifs/home is read-only on the compute cluster. Your analysis tasks should be done in /ifs/scratch. Results that need to be backed-up and resource databases can be stored in /ifs/data. We don't allow write access to /ifs/home and discourage writing to /ifs/data from the cluster because of interaction with snapshots affects the performance of the Isilon storage system.
- Do not store excessive amounts of files in a single directory. In general, storing more than a few thousand files in a directory can start to noticeably degrade storage performance. Storing tens of thousands of files or more can seriously damage over-all performance, and often requires significant time and effort to rectify. If you need to store a large number of files, you can either use archival techniques (e.g. tar, or cpio commands) or break the files up into a tiered directory structure.
- Storing a large amount of very small files can take up much more space than you expect. Because of the way that large clustered filesystems work, a 4KB file (for instance) will really take up roughly 128KB of space. Small files can easily be archived into a single large file using the `tar` command in order to safe space and reduce filesystem clutter.
Temporary Files -- new for HPC cluster
Some programs create and manipulate small temporary files during computation (see, for instance, Running compiled Matlab jobs below). In some special cases, it may be more efficient to use the local disk on the node instead of the scratch space. On the HPC cluster there is now about 400GB of space on each cluster node for local temporary file use. You can put files in the directory contained in the $TMPDIR environment variable. The $TMPDIR directory is automatically setup by SGE when a job starts and is automatically removed when the job finishes.
For a more detailed description of storage and its use, see Storage architecture.
Finding and running applications
We support a limited number of applications on the cluster, such as Java, Perl, matlab & R. While we will try to maintain current versions of these applications, we cannot manage modules/classes that specific applications need. In general, if you need specific modules you should compile them into your home, data or scratch directory (or perhaps into the labs "share" directory). Applications can be found in /nfs/apps. These applications are available only on the backend nodes. We do not recommend that you use the versions included in the cluster operating system (e.g. /usr/bin/perl), as those are designed for running the cluster itself, and may change without warning.
If there is an application that you believe should be provide in /nfs/apps, you can send a request to rt <at> c2b2.columbia.edu; however, we will probably only agree to maintain applications that will be widely used by multiple labs.
Queueing and SGE
HPC is controlled by Sun Grid Engine (SGE). There are several aspects of queuing that are unique to the HPC configuration:
Memory management
We have had consistent problems in the past with cluster jobs running nodes out of memory and thereby crashing the nodes. This is problematic because, not only does the user’s job die, but anyone else’s job that is on the same node dies as well. This becomes an even greater problem on HPC nodes where 12 jobs will be running on the same node. One user overflowing memory could cause 7 other people’s work to fail.
In order to solve this problem, the queuing system requires that all jobs be submitted with hard memory limit values. This will allow the queuing system to ensure that all jobs end up on nodes that can support their memory requirements and also enforce that jobs stay within the bounds they advertise. It is important to note that, if you try to exceed the limit you request your job will die. This prevents your job from causing the node to crash and disrupting the work of others.
Memory limits can be specified on the command line or through a qsub script by supplying the option “-l mem=<memory_spec>”, where <memory_spec> is of the form, e.g. 2G, 512M, etc. While you do not want to guess your memory requirements too closely to avoid having your job die, you also do not want to request significantly more memory than you will use because it will be harder to get your job through the queue since SGE has to find a slot with sufficient memory available.
If you do not specify the “-l mem=<memory_spec>” option, your job will run with a default 1GB memory limit. If your job requests more memory than this it will die without completing.
We recommend that you give yourself a 25% - 50% margin when selecting memory requirements to make sure your jobs do not unnecessarily die. The scheduler is built to compensate for these over-estimates. As a result, even if you know, e.g. that your job will use exactly 2GB of memory, you should request 2.5GB or 3GB.
Time based queuing
A time-based queuing system is being implemented that limits the number of slots available to a job based on the runtime limit of the job. This is in place of the hard limit on the number of jobs that a user can run in our present system. Similar to the memory settings, the time limit of a job can be given either on the command line or in a qsub script using the “-l time=<hh:mm:ss>” option, e.g. time=1:15:30 (1h, 15m, 30s), time=:20: (20m).
The time queuing system works by splitting the cluster into different time limit queues. The size and limits of the queues are optimized based on cluster usage. The queues have names like "std_6h.q". This indicates that the maximum runtime for a job in the queue is 21,600 seconds. You need not worry about telling the system which queue to go to. If you specify the “-l time=<hh:mm:ss>” parameter the queuing system will automatically find the queues you can fit in. As an example, there are presently seven regular queues (plus special queues for interactive, Infiniband, and GPU jobs): std_6h.q std_12h.q std_1d.q std_2d.q std_1w.q std_2w.q std_1m.q. For example, a job with a 16 hour limit will automatically be placed in the shortest (time) queue available to it, std_1d.q.
This table illustrates the current bin sizes (this may change over time):
bin | max runtime | CPUs available |
---|---|---|
std_6h | 6 hr | 396 |
std_12h | 12 hr | 660 |
std_1d | 24 hr | 792 |
std_2d | 48 hr | 924 |
std_1w | 1 week = 168 hr | 792 |
std_2w | 2 week = 336 hr | 660 |
std_1m | 1 month = 672 hr | 396 |
As you use the system, you will be begin to notice that your waiting jobs have a "priority" value assigned to them. As the name suggests, this value helps the queuing system decide which jobs will run in which order. There are multiple variables that are used to determine the priority for any particular job. The can be broken as follows:
- Functional scheduling policy: The functional scheduling policy tries to enforce an equal distribution of users at any given time. Each user is equally weighted, and as users enter the queue, the priorities will be adjusted to distribute jobs fairly.
- Sharetree scheduling policy: While the functional scheduling policy enforces a fair distribution of jobs at any instance, share tree queuing ensures that over a longer period of time usage remains fair. This means that a user who has been running a lot of jobs over, say, the last week, will be given lower priority than a user who is submitting jobs for the first time in a while.
- Urgency: Urgency scheduling adjusts priority based on the resources requested and the length of time the job has been waiting. In general, jobs that request more resources will be given higher priority to adjust for the difficulty of scheduling larger resources. Additionally, the urgency policy will make the jobs priority go up over time in an attempt to ensure that no job has to wait too long compared to other jobs.
- User priority: Using the "-p" option, users can set a priority that will only affect their own jobs. The user priority will help to control which of a particular users' jobs will run in which order.
Putting it together: an example session
We are aware that users may not know exactly how long their program is going to run or how much memory they will need for the program. When this is the case, we recommend that you run the program a small number of times asking for more resources than you expect to use and use the `qacct –j <job_id>` command to determine the usual resources required by the program. qacct is available on login nodes (currently not available in the interactive sessions.) For example, if you have a program called `workhard` that you would like to use on the cluster, the following might be a reasonable workflow for getting it successfully running:
1. qlogin to an interactive node to test to make sure the program works on the cluster:
$ ssh user@login.c2b2.columbia.edu $ qlogin –l mem=2G,time=:20: $ cd /ifs/scratch/c2b2/my_lab/user $ ./workhard
Program returned successfully.
$ exit
2. submit a qsub job with larger resources than expected:
-- BEGIN: workhard.sh -- #!/bin/bash #$ -l mem=4G,time=1:: -S /bin/bash –N workhard –j y –cwd ./workhard -- END: workhard.sh --
And run:
$ qsub workhard.sh Your job 144 ("workhard.sh") has been submitted <wait for job to complete> $ qacct –j 144 start_time Wed May 6 13:34:34 2009 end_time Wed May 6 13:56:04 2009 maxvmem 1.05G
3. Now I know that my program uses a little more than 1G of memory and runs for about 25 minutes. I can now adjust and submit multiple jobs:
-- BEGIN: workhard.sh -- #!/bin/bash #$ -l mem=1.2G,time=:30: -S /bin/bash –N workhard –j y –cwd ./workhard -- END: workhard.sh --
And run:
$ qsub –t 1-40 workhard.sh
4.) Interactive session running matlab.
ssh to login node
-bash-4.1$ hostname login3.hpc
As you can see we are on login node 3. Let's use qrsh command to start interactive session for up to 30 minutes with 4G of ram.
-bash-4.1$ qrsh -l mem=4G,time=:30:
When we scheduled on the node we would be back to prompt. Let's check if we were successful.
-bash-4.1$ hostname ha5c6n8.hpc -bash-4.1$ qstat job-ID prior name user state submit/start at queue jclass slots ja-task-ID ------------------------------------------------------------------------------------------------------------------------------------------------ 8629565 1.00500 QRLOGIN fs2307 r 05/16/2017 12:53:47 int.q@ha5c6n8.hpc 1
As you can see we are now on one of the cluster nodes and no longer on the login nodes and we have interactive session running with job id 8629565 If we crash and we think it might be memory related we would be able to use qacct -j 8629565 command to see memory usage and status. Now let's load matlab module (you can manually set path or just use full path to call version of matlab you want instead.)
bash-4.1$ module load matlab/2016a -bash-4.1$ which matlab /nfs/apps/matlab/2016a/bin/matlab
Let's start matlab.
-bash-4.1$ matlab -nodisplay -nojvm -singleCompThread < M A T L A B (R) > Copyright 1984-2016 The MathWorks, Inc. R2016a (9.0.0.341360) 64-bit (glnxa64) February 11, 2016 For online documentation, see http://www.mathworks.com/support For product information, visit www.mathworks.com. Academic License
At this point I can type commands and see execution result.
>> echo on; >> x=1; >> disp(x) 1 >> y=x+2; >> disp(y) 3 >> exit;
I exited matlab. Let's make sure I am still in interactive session.
-bash-4.1$ hostname ha5c6n8.hpc
Ok I am still there. If I want to end it now I can just type exit.
-bash-4.1$ exit Logout -bash-4.1$ hostname login3.hpc
we back to login node. Let's see how much resources we used
qacct -j 8629565 ============================================================== qname int.q hostname ha5c6n8.hpc group sysops owner fs2307 project sysops department sysops jobname QRLOGIN jobnumber 8629565 taskid undefined account sge priority 0 qsub_time Tue May 16 12:52:36 2017 start_time Tue May 16 12:53:47 2017 end_time Tue May 16 12:57:45 2017 granted_pe NONE slots 1 failed 0 exit_status 0 ru_wallclock 238 ru_utime 6.807 ru_stime 1.094 ru_maxrss 213492 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 151599 ru_majflt 532 ru_nswap 0 ru_inblock 355648 ru_oublock 5336 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 7924 ru_nivcsw 8281 cpu 7.901 mem 3.946 io 0.042 iow 0.000 maxvmem 1.388G arid undefined jc_name NONE
By looking at maxvmem we can see that we haven't go about 1.3888G of ram. So for this very basic calculation requesting 4G was a bit too much and 2G probably was safe enough.
more qacct examples
Common Problems when migrating from older SGE clusters (titan)
Some of the scripts might point to location using current symbolic link. for example
/nfs/apps/matlab/current
This might cause it to be resolved to a different version of application and might cause script to fail. In general for scripts that needs to be run multiple times over long period of time and pipeline jobs we recommend to not use current and always use path to specific version of application for example
/nfs/apps/matlab/2013b
If new version of application is available current would be switch to latest release and that might cause older script to break. Don't use current unless you script expected to work with changing version of the application and always use latest one.
The issue with loading .bash_profile and .bashrc that some users had in the first wave of migration was fixed and now for all queues .bash_profile .bashrc are loaded automatically as they were on titan. so having -l on the first line of submission script is no longer required.
Final notes
Please offer feedback on the cluster and we will work to make it as reliable, powerful and usable as possible.
To request access to HPC, please send a ticket to rt <at> c2b2.columbia.edu, or through the web interface at https://rt.c2b2.columbia.edu
Additional notes & examples
Module system and Available Software
See Module System
Access to different GCC on CentOS 7
Multiple versions of GCC are available on CentOS 7. To access them use devtoolset system.
currently following devtoolsets are installed
devtoolset-3 - gcc 4.9.2-6 devtoolset-4 - gcc 5.3.1-6 devtoolset-6 - gcc 6.3.1-3 devtoolset-7 - gcc 7.3.1-5 devtoolset-8 - gcc 8.2.1-3
To enable specific toolset use scl command See following example
$gcc --version | grep GCC gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) $scl enable devtoolset-6 bash $ gcc --version | grep GCC gcc (GCC) 6.3.1 20170216 (Red Hat 6.3.1-3)
Access to old CentOS 6 nodes
To run your job on old CentOS6 nodes you have to add
-l centos6=1
to your job submission parameters.
PS: There are only few nodes available running old nodes, so wait time for those are longer. Certain combinations of memory request might not be possible
(For example there are no nodes with more than 256G of ram running CentOS 6)
Running an array job
Many analysis tasks involve running the same program repeatedly with slightly different parameters. In these cases, it is much easier, both for the user and for the queuing system, to submit all of these jobs as one array job. Array jobs submit the same script a specified number of times and increment the environment variable SGE_TASK_ID for each instance. Array jobs appear as a single job with many different tasks.
In order to submit an array job you need to supply the array option ('t') to your qsub, e.g.
$ qsub -t <num_tasks> job.sh
where <num_tasks> is the number of instances of this job that should be run. Each task will be run with one of the integer values between 1 and <num_tasks> (note: you can be more specific about how SGE iterates over this variable, e.g. only having odd task IDs. see `man qsub` for details). It is up to your script to decide how to handle that variable.
There are many ways to use the SGE_TASK_ID variable, but one very common task is simply to run the same program with multiple different sets of arguments. Here is an example (which can easily be generalized) of how you might run an array job.
Suppose you have several different sequences, seq-A.fa, seq-B.fa, seq-C.fa, etc.. You want to use blastp against a database "some_db" for each of these sequences. Rather than submit these all individually, here's a quick and easy way to submit an array job.
Assume that everything lives in the same directory for convenience.
1. Create a file that contains one sequence filename per line, e.g.:
ls seq-* > sequences-files
2. Create the following script called "blast.sh":
#!/bin/bash # Setup qsub options ## -l mem=1G,time=1:: -cwd -S /bin/bash -N blast # If SGE_TASK_ID is N, set SEQ_FILE to the Nth line of "sequence-files" SEQ_FILE=`awk 'NR=='$SGE_TASK_ID'{print}' sequence-files` # Run blast with the $SEQ_FILE as input, and $SEQ_FILE.out as output /nfs/apps/blast/current/bin/blastall -p blastp -d some_db -i $SEQ_FILE -o $SEQ_FILE.out
3. Submit the script with -t specifying the number of lines in "sequence-files": (in this case we will be doing range from 1 to number of lines)
qsub -t 1-`wc -l < sequence-files` blast.sh
The job will now run a task for each sequence file, outputting results to seq-A.fa.out, seq-B.fa.out, etc.
Running compiled Matlab jobs
Compiled matlab jobs present a couple of challenges to running on any large cluster. Matlab uses a cache directory, called the MCR cache, to extract and manage the tools and libraries that each executable needs to run. By default this would be created in your home directory. Since the home directory is read-only, the location of the cache needs to be set to a writeable area.
Additionally, running multiple concurrent jobs in the same cache directory causes the cache to become corrupt and can cause erratic behaviors and failures from the matlab jobs. To ensure cache coherency, the cache directory should be set to a differently location for each job. Because these cache directories can start to build up, it is important to clean-up old cache directories. You can either direct the cache to, e.g. the /ifs/scratch filesystem, or you can use the $TMPDIR environment variable which points to a special directory on the local hard disk of the node that will automatically be deleted once the job completes.
NOTE: The MCR cache is very small (< 32M) so it is fine to use the $TMPDIR to store the cache; however, HPC is not designed to support larger usage of the $TMPDIR since the local disks are small and slow.
The appropriate environment variables to set are:
MCR_CACHE_SIZE - Sets the maximum size of the cache (should probably be the default, 32M)
MCR_CACHE_ROOT - The location of the MCR cache
The following is an example qsub script for running a compiled matlab job on HPC:
#!/bin/bash #$ -cwd -S /bin/bash -N matlab_job export LD_LIBRARY_PATH=/nfs/apps/matlab/current/bin/glnxa64:/nfs/apps/matlab/current/sys/os/glnxa64:$LD_LIBRARY_PATH export MCR_CACHE_SIZE=32M export MCR_CACHE_ROOT=$TMPDIR/mcr_cache mkdir -p $TMPDIR/mcr_cache
/path/to/your/executable <arguments>
Please keep in mind. In this example we used /nfs/apps/matlab/current as base location for Matlab. It is better to specify version you want to use in your script when using compiled script. In the future as new versions are available current might be switched to point to newer version and might cause old script to break if there are incompatibility between versions libraries.
Running with resource reservations
Some cluster jobs require a very large amount of concurrent resources (e.g. CPUs or RAM) to run, and it may be difficult for the scheduler to find an open resource to run the job. To help these jobs get scheduled in a timely way, SGE uses "resource reservations." A resource reservation puts placeholders in the queueing system to make sure that the necessary resources free up as soon as possible.
It works like this: suppose you want to submit a 32-CPU MPI job (see below for MPI examples). If the cluster is busy, there may not be 32 applicable slots available at any given moment to start the job. Without reservations, the queuing system will just keep trying each cycle to see if 32 suitable CPUs happen to become available. In practice, for large jobs in a busy system, this will result in almost indefinite wait times.
With the resource reservations enabled, the queuing system will pre-decide which slots the job is going to run in (based on an internal optimization algorithm). Some of these slots may have currently running jobs in them, and some may be free. The queuing system will make sure that nothing occupies the reserved slots in such a way as to delay the start time of the waiting job.
So, say, your 32-CPU job is waiting on 16 currently occupied slots. The queuing system (because of our time request policy) knows about how long each of the running jobs will last. If it knows that the last of the 16 currently running jobs will free up in 30 minutes, it will make sure no new jobs will be scheduled in the reserved slots that will finish later than 30 minutes from now (Though it may schedule short running jobs in reserved slots that will finish in 30 minutes or less. This is a process known as "backfilling.").
Because computing resource reservations is a computationally intensive task for the scheduler, reservations are not turned on by default. In general, it is only necessary for jobs that require exceptional amounts of CPUs or RAM. To enable resource reservations, simply add the "-R y" option to your qsub, e.g.
$ qsub -R y job.sh
Running an SMP (multi-threaded) job
When you request a slot on hpc you are allocated a single CPU core. SMP jobs run on multiple CPU cores and therefore require more resources than a single slot allocates. In order to correctly run a multi-threaded job on hpc you need to use a parallel environment. The parallel environment for SMP jobs is called "smp." You can request an SMP job using the '-pe' option when you qsub your job. The syntax for the '-pe' option is:
-pe smp <number_of_threads>
You can pass this option either through the command line like this:
qsub -pe smp 4
which would allocate enough cores for a 4 threaded job. Or, you can specify the same option in your qsub file like:
#$ -pe smp 4
The maximum theoretical number of threads you can request on hpc is 12 since the hpc nodes have 12 cores each. Because all of the cores must be allocated on the same node, requesting 12 cores will require an entire node to be free which may take longer to queue. In some cases, it may be more efficient to use fewer threads. We advise our users to run 8 or less threads in their requests!
Running an OpenMPI job
This section is under construction MPI (Message Passing Interface) is a method for running parallel jobs across multiple nodes. This gets past the problem mention in the SMP section of having to allocate all of the resources on a single node, and as such can scale in some cases to hundreds or thousands of cores for a single job. However, your program must be specifically programmed to use MPI.
The most common form of MPI found these days is OpenMPI. Which have current versions of OpenMPI supported on hpc. In order to use OpenMPI you will need to use an 'parallel environment'. The parallel environment for OpenMPI is called "orte." As in the SMP case, the syntax is:
-pe orte <number_of_slaves>
where <number_of_slaves> is the number of cores requested for the MPI job.
In addition to running the job in the appropriate parallel environment, you need to run your job with mpirun. You need to setup your library path for OpenMPI and run the appropriate mpirun command. The MPI libraries are located in:
/nfs/apps/openmpi/current/lib
(current is a symbolic link to the latest openmpi version).
To setup your execution, library and man paths in your script environment, you need to source the following shell script
. /nfs/apps/openmpi/current/setenv.sh
in your job script.
SGE has "tight integration" with OpenMPI, so you shouldn't need to pass any parameters to mpirun. It will automatically know how many slaves it can run and where to run them.
As an example, say you want to run your program "mpiprog" on 32 cores. You might use the following script:
#!/bin/bash #$ -cwd -S /bin/bash -N mpiprog #$ -l mem=1G,time=6:: #$ -pe orte 32 . /nfs/apps/openmpi/current/setenv.sh mpirun mpiprog
This will dispatch a 32 core OpenMPI job on the cluster.
Note: your memory resource request is the memory per slave, not the total memory of your job. The above example will allocate a total of 32GB of memory across the cluster.
If your MPI job is "communications bound" (i.e. pass many large messages or many very quick messages) it may be to your advantage to use the InfiniBand enabled section of the cluster. InfiniBand is a very high-speed, low latency interconnect that OpenMPI has optimizations to use. There are a total of 256 cores available that are linked with InfiniBand.
In order to use the InfiniBand portion of the cluster you will want to specify that your job needs the infiniband resource. To do this, simply add the following line to your file:
#$ -l infiniband=TRUE
That's it. OpenMPI should automatically figure out that all of your slaves have access to InfiniBand and will automatically select it as the default communication fabric.
Note: "ib" is an alias for the "infiniband" resource, also (0,1) map to (FALSE,TRUE), so the following also request the infiniband resource:
$# -l ib=1 $# -l infiniband=1 $# -l ib=TRUE
A few MPI programs are very picky about fragmentation (i.e. getting split across too many different nodes). If your job is picky about fragmentation you can use the orte_8 parallel environment that will ensure you get 8 CPU's per node for your job (i.e. whole nodes). This should be avoided if possible, because it may result in long wait times for your job to be scheduled.
A final note: running large parallel jobs can be a very particular task and often requires extra testing and care. For instance, you cannot always assume that your program will simple run faster if you run it across more slaves. In most cases, MPI enabled programs perform better and better as you add more slaves until a point. After that point, the jobs will actually start performing worse. It's a good practice to run several instances of your MPI job at different numbers of slaves the first time and compare performance, e.g. run jobs with 2,4,8,16,32,64,128 and 256 slaves and see how the performance scales.
Interested in enabling MPI in your program? The following links might be helpful:
- The official OpenMPI website
- A nice intro to parallel computing posted by LLNL
- An MPI programming intro posted by LLNL
- There as a book "Using MPI" by Gropp, Lusk and Skjellum (ISBN: 978-0262571326) that provides a good introduction.
Job usage
Use qstat to find out about job status and resources while it's in pending or running status. See manual for qstat
man qstat
Running qstat without any parameters will return list of currently pending and running jobs for current user.
Most typical example would involve calling qstat with job number
qstat -j <jobnumber>
For job that finished running you will need to use qacct
qacct examples