Documentation/Titan cluster

From Systems

< Documentation(Redirected from FAQ/Titan cluster)
Jump to: navigation, search

Contents

Overview of the cluster

Titan compute cluster

The primary, general purpose compute cluster at C2B2 is called "Titan." Titan is a Linux-based compute cluster consisting of 464 HP blade systems, two head nodes and a virtualized pool of login (submit) nodes. The nodes fit in a dense configuration in five high-density racks and are cooled by dedicated rack refrigeration systems.

Each node has eight cores (two quad-core processors), and either 16GB (430 nodes) or 32GB (34 nodes) of memory. This provices 3712 compute cores and 8 TB of total RAM (memory). Each node has a 1 Gbps ethernet connection to the cluster network, and each 16 nodes are aggregated through a 10 Gbps ethernet link to a core switch. Additionally, 32 (256 CPU-cores) nodes are linked with 20 Gbps DDR InfiniBand. Two head nodes are setup to provide redundancy to keep the cluster operational despite hardware failures. Additionally, a set of login nodes running Citrix XenServer virtualization provide a pool of virtual login nodes for user access to this and other systems.

Like our previous clusters, the system is controlled by Sun Grid Engine (SGE), but with some important differences in configuration from the previous clusters (detailed below).

Storage for the cluster is provided exclusively by our Isilon clustered filesystem, which provides multiple (presently 18) 1 Gbps points of access for NFS storage and a scalable configuration and nearly 1 PB of total capacity.

“Titan” debuted as the 124th fastest computer in the world in November of 2008.

Accessing the system

Access is provided to Titan through a pool of login nodes. Simply SSH to login.c2b2.columbia.edu and login with your ARCS domain login and password. You will automatically be placed on one of our login nodes where you will be able to submit jobs, monitor tasks and start interactive sessions.

Please note, no computation should take place on the login nodes. The login nodes are intended as a gateway resource to our storage and computing systems only.

Using the storage

Titan exclusively mounts the Isilon filesystem (/ifs). The /ifs filesystem is broken up into four sections: /ifs/home, /ifs/data, /ifs/scratch and /ifs/archive. Users should have folders in at least home and scratch, and most will have folders in data and archive as well. /ifs/home is where your login home directory lives. It has regular filesystem snapshots and backups. /ifs/scratch is where most cluster computing should happen. It has no snapshots or backups, but provides high-speed storage for computing. /ifs/data is a medium between home and scratch. It has no snapshots, but it is backed up once per month to provide disaster recovery. This area is intended for items like computationally generated databases that need to be backed up, but are also written to by the cluster. Finally, /ifs/archive provides low-cost archival storage for data that is no longer active. /ifs/archive space is accessible from the login nodes but not the compute nodes on the cluster.

!!!IMPORTANT!!! Because cluster generated data should not usually be backed up, and filesystem snapshotting does not deal well with the large amounts of filesystem changes that cluster computing tends to have, /ifs/home should never be used as a directory for cluster computing. With this in mind, /ifs/home is mounted read-only on the cluster backend nodes. One of the most common early problems users have with running on titan is accidentally trying to run out of the /ifs/home director. Jobs that try to do this will go into "error" state and will not run.

You can see your filesystem quotas by using: $ df –h <your_directory> while on the login nodes. Your directory (or lab’s directory in some cases) presents itself as an independent disk drive with a maximum size equal to your quota. Note that this will not work on the compute nodes (e.g. through qrsh).

There are some important guidelines for storage use that are specific to cluster computing:

  • Again, /ifs/home is read-only on the compute cluster. Your analysis tasks should be done in /ifs/scratch. Results that need to be backed-up and resource databases can be stored in /ifs/data.
  • Do not store excessive amounts of files in a single directory. In general, storing more than a few thousand files in a directory can start to noticeably degrade storage performance. Storing tens of thousands of files ore more can seriously damage over-all performance, and often requires significant time and effort to rectify. If you need to store a large number of files, you can either use archival techniques (e.g. tar, or cpio commands) or break the files up into a tiered directory structure.
  • Storing a large amount of very small files can take up much more space than you expect. Because of the way that large clustered filesystems work, a 4KB file (for instance) will really take up roughly 128KB of space. Small files can easily be archived into a single large file using the `tar` command in order to safe space and reduce filesystem clutter.

Some programs create and manipulate small temporary files during computation (see, for instance, Running compiled Matlab jobs below). In some special cases, it may be more efficient to use the local disk on the node instead of the scratch space. Local disk space is very limited. This should only be used for files that do not exceed more than a couple of hundred MB at most. If your program requires a small temporary space on the local disk, you can put files in the directory contained in the $TMPDIR environment variable. The $TMPDIR directory is automatically setup by SGE when a job starts and is automatically removed when the job finishes.

For a more detailed description of storage and its use, see Storage architecture.

Finding and running applications

We support a limited number of applications on the cluster, such as Java, Perl, matlab & R. While we will try to maintain current versions of these applications, we cannot manage modules/classes that specific applications need. In general, if you need specific modules you should compile them into your home, data or scratch directory (or perhaps into the labs "share" directory). Applications can be found in /nfs/apps. These applications are available only on the backend nodes. We do not recommend that you use the versions included in the cluster operating system (e.g. /usr/bin/perl), as those are designed for running the cluster itself, and may change without warning.

If there is an application that you believe should be provide in /nfs/apps, you can send a request to rt <at> c2b2.columbia.edu; however, we will probably only agree to maintain applications that will be widely used by multiple labs.

Queuing and SGE

Titan is controlled by Sun Grid Engine (SGE). There are several aspects of queuing that are unique to the Titan configuration:

Memory management

We have had consistent problems in the past with cluster jobs running nodes out of memory and thereby crashing the nodes. This is problematic because, not only does the user’s job die, but anyone else’s job that is on the same node dies as well. This becomes an even greater problem on titan nodes where 8 jobs will be running on the same node. One user overflowing memory could cause 7 other people’s work to fail.

In order to solve this problem, the queuing system requires that all jobs be submitted with hard memory limit values. This will allow the queuing system to ensure that all jobs end up on nodes that can support their memory requirements and also enforce that jobs stay within the bounds they advertise. It is important to note that, if you try to exceed the limit you request your job will die. This prevents your job from causing the node to crash and disrupting the work of others.

Memory limits can be specified on the command line or through a qsub script by supplying the option “-l mem=<memory_spec>”, where <memory_spec> is of the form, e.g. 2G, 512M, etc. While you do not want to guess your memory requirements too closely to avoid having your job die, you also do not want to request significantly more memory than you will use because it will be harder to get your job through the queue since SGE has to find a slot with sufficient memory available.

If you do not specify the “-l mem=<memory_spec>” option, your job will run with a default 1GB memory limit. If your job requests more memory than this it will die without completing.

We recommend that you give yourself a 25% - 50% margin when selecting memory requirements to make sure your jobs do not unnecessarily die. The scheduler is built to compensate for these over-estimates. As a result, even if you know, e.g. that your job will use exactly 2GB of memory, you should request 2.5GB or 3GB.

Time based queuing

Queue time bin diagram

A time-based queuing system is being implemented that limits the number of slots available to a job based on the runtime limit of the job. This is in place of the hard limit on the number of jobs that a user can run in our present system. Similar to the memory settings, the time limit of a job can be given either on the command line or in a qsub script using the “-l time=<hh:mm:ss>” option, e.g. time=1:15:30 (1h, 15m, 30s), time=:20: (20m).

The time queuing system works by splitting the cluster into different time limit queues. The size and limits of the queues are optimized based on cluster usage. The queues have names like "pow18.q". This indicates that the maximum runtime for a job in the queue is 2^18 seconds. Additionally, there is a small queue that has no limit called "infinity.q". You need not worry about telling the system which queue to go to. If you specify the “-l time=<hh:mm:ss>” parameter the queuing system will automatically find the queues you can fit in. As an example, there are presently eight queues (plus special queues for interactive jobs): pow12.q pow13.q pow15.q pow18.q pow19.q pow20.q pow22.q infinity.q. A job with a 6 hour limit will automatically be placed in the largest queue available to it, pow15.q.

This table illustrates the current bin sizes (this will change over time):

bin max runtime CPUs available
pow12 2^12 s ~= 1.1 hr 3680
pow13 2^13 s ~= 2.3 hr 3264
pow15 2^15 s ~= 9.1 hr 2913
pow18 2^18 s ~= 3 d 2229
pow19 2^19 s ~= 6 d 1319
pow20 2^20 s ~= 12 d 859
pow22 2^22 s ~= 49 d 410
infinity unlimited 35

New scheduling priority algorithm (sharetree + functional)

As you use the system, you will be begin to notice that your waiting jobs have a "priority" value assigned to them. As the name suggests, this value helps the queuing system decide which jobs will run in which order. There are multiple variables that are used to determine the priority for any particular job. The can be broken as follows:

  • Functional scheduling policy: The functional scheduling policy tries to enforce an equal distribution of users at any given time. Each user is equally weighted, and as users enter the queue, the priorities will be adjusted to distribute jobs fairly.
  • Sharetree scheduling policy: While the functional scheduling policy enforces a fair distribution of jobs at any instance, share tree queuing ensures that over a longer period of time usage remains fair. This means that a user who has been running a lot of jobs over, say, the last week, will be given lower priority than a user who is submitting jobs for the first time in a while.
  • Urgency: Urgency scheduling adjusts priority based on the resources requested and the length of time the job has been waiting. In general, jobs that request more resources will be given higher priority to adjust for the difficulty of scheduling larger resources. Additionally, the urgency policy will make the jobs priority go up over time in an attempt to ensure that no job has to wait too long compared to other jobs.
  • User priority: Using the "-p" option, users can set a priority that will only affect their own jobs. The user priority will help to control which of a particular users' jobs will run in which order.

Putting it together: an example session

We are aware that users may not know exactly how long their program is going to run or how much memory they will need for the program. When this is the case, we recommend that you run the program a small number of times asking for more resources than you expect to use and use the `qacct –j <job_id>` command to determine the usual resources required by the program. For example, if you have a program called `workhard` that you would like to use on the cluster, the following might be a reasonable workflow for getting it successfully running:

1. qlogin to an interactive node to test to make sure the program works on the cluster:

$ ssh user@login.c2b2.columbia.edu
$ qlogin –l mem=2G,time=:20:
$ cd /ifs/scratch/c2b2/my_lab/user
$ ./workhard

Program returned successfully.

$ exit

2. submit a qsub job with larger resources than expected:

-- BEGIN: workhard.sh --
#!/bin/bash
#$ -l mem=4G,time=1:: -S /bin/bash –N workhard –j y –cwd
./workhard
-- END: workhard.sh --

And run:

$ qsub workhard.sh
Your job 144 ("workhard.sh") has been submitted
<wait for job to complete> 
$ qacct –j 144
start_time   Wed May  6 13:34:34 2009
end_time     Wed May  6 13:56:04 2009
maxvmem      1.05G

3. Now I know that my program uses a little more than 1G of memory and runs for about 25 minutes. I can now adjust and submit multiple jobs:

-- BEGIN: workhard.sh --
#!/bin/bash
#$ -l mem=1.2G,time=:30: -S /bin/bash –N workhard –j y –cwd
./workhard 
-- END: workhard.sh --

And run:

$ qsub –t 1-40 workhard.sh
NOTE: If you are submitting array jobs (with -t), the new version of SGE has change the environment variable SGE_TASK_ID to TASK_ID.

Final notes

Please offer feedback on the cluster and we will work to make it as reliable, powerful and usable as possible.

To request access to Titan, please send a ticket to rt <at> c2b2.columbia.edu, or through the web interface at https://support.c2b2.columbia.edu/rt/

Additional notes & examples

Running an array job

Many analysis tasks involve running the same program repeatedly with slightly different parameters. In these cases, it is much easier, both for the user and for the queuing system, to submit all of these jobs as one array job. Array jobs submit the same script a specified number of times and increment the environment variable SGE_TASK_ID for each instance. Array jobs appear as a single job with many different tasks.

In order to submit an array job you need to supply the array option ('t') to your qsub, e.g.

$ qsub -t <num_tasks> job.sh

where <num_tasks> is the number of instances of this job that should be run. Each task will be run with one of the integer values between 1 and <num_tasks> (note: you can be more specific about how SGE iterates over this variable, e.g. only having odd task IDs. see `man qsub` for details). It is up to your script to decide how to handle that variable.

There are many ways to use the SGE_TASK_ID variable, but one very common task is simply to run the same program with multiple different sets of arguments. Here is an example (which can easily be generalized) of how you might run an array job.

Suppose you have several different sequences, seq-A.fa, seq-B.fa, seq-C.fa, etc.. You want to use blastp against a database "some_db" for each of these sequences. Rather than submit these all individually, here's a quick and easy way to submit an array job.

Assume that everything lives in the same directory for convenience.

1. Create a file that contains one sequence filename per line, e.g.:

ls seq-* > sequences-files

2. Create the following script called "blast.sh":

#!/bin/bash
# Setup qsub options
$# -l mem=1G,time=1:: -cwd -S /bin/bash -N blast
# If SGE_TASK_ID is N, set SEQ_FILE to the Nth line of "sequence-files"
SEQ_FILE=`awk 'NR=='$SGE_TASK_ID'{print}' sequence-files`
# Run blast with the $SEQ_FILE as input, and $SEQ_FILE.out as output
/nfs/apps/blast/current/bin/blastall -p blastp -d some_db -i $SEQ_FILE -o $SEQ_FILE.out

3. Submit the script with -t specifying the number of lines in "sequence-files":

qsub -t `wc -l < sequence-files` blast.sh

The job will now run a task for each sequence file, outputting results to seq-A.fa.out, seq-B.fa.out, etc.

Running compiled Matlab jobs

Compiled matlab jobs present a couple of challenges to running on any large cluster. Matlab uses a cache directory, called the MCR cache, to extract and manage the tools and libraries that each executable needs to run. By default this would be created in your home directory. Since the home directory is read-only, the location of the cache needs to be set to a writeable area.

Additionally, running multiple concurrent jobs in the same cache directory causes the cache to become corrupt and can cause erratic behaviors and failures from the matlab jobs. To ensure cache coherency, the cache directory should be set to a differently location for each job. Because these cache directories can start to build up, it is important to clean-up old cache directories. You can either direct the cache to, e.g. the /ifs/scratch filesystem, or you can use the $TMPDIR environment variable which points to a special directory on the local hard disk of the node that will automatically be deleted once the job completes.

NOTE: The MCR cache is very small (< 32M) so it is fine to use the $TMPDIR to store the cache; 
however, Titan is not designed to support larger usage of the $TMPDIR since the local disks are small and slow.

The appropriate environment variables to set are:

MCR_CACHE_SIZE - Sets the maximum size of the cache (should probably be the default, 32M)

MCR_CACHE_ROOT - The location of the MCR cache

The following is an example qsub script for running a compiled matlab job on Titan:

#!/bin/bash
#$ -cwd -S /bin/bash -N matlab_job
export LD_LIBRARY_PATH=/nfs/apps/matlab/current/bin/glnxa64:/nfs/apps/matlab/current/sys/os/glnxa64:$LD_LIBRARY_PATH
export MCR_CACHE_SIZE=32M
export MCR_CACHE_ROOT=$TMPDIR/mcr_cache
mkdir -p $TMPDIR/mcr_cache

/path/to/your/executable <arguments>

Running with resource reservations

Some cluster jobs require a very large amount of concurrent resources (e.g. CPUs or RAM) to run, and it may be difficult for the scheduler to find an open resource to run the job. To help these jobs get scheduled in a timely way, SGE uses "resource reservations." A resource reservation puts placeholders in the queueing system to make sure that the necessary resources free up as soon as possible.

It works like this: suppose you want to submit a 32-CPU MPI job (see below for MPI examples). If the cluster is busy, there may not be 32 applicable slots available at any given moment to start the job. Without reservations, the queuing system will just keep trying each cycle to see if 32 suitable CPUs happen to become available. In practice, for large jobs in a busy system, this will result in almost indefinite wait times.

With the resource reservations enabled, the queuing system will pre-decide which slots the job is going to run in (based on an internal optimization algorithm). Some of these slots may have currently running jobs in them, and some may be free. The queuing system will make sure that nothing occupies the reserved slots in such a way as to delay the start time of the waiting job.

So, say, your 32-CPU job is waiting on 16 currently occupied slots. The queuing system (because of our time request policy) knows about how long each of the running jobs will last. If it knows that the last of the 16 currently running jobs will free up in 30 minutes, it will make sure no new jobs will be scheduled in the reserved slots that will finish later than 30 minutes from now (Though it may schedule short running jobs in reserved slots that will finish in 30 minutes or less. This is a process known as "backfilling.").

Because computing resource reservations is a computationally intensive task for the scheduler, reservations are not turned on by default. In general, it is only necessary for jobs that require exceptional amounts of CPUs or RAM. To enable resource reservations, simply add the "-R y" option to your qsub, e.g.

$ qsub -R y job.sh

Running an SMP (multi-threaded) job

When you request a slot on titan you are allocated a single CPU core. SMP jobs run on multiple CPU cores and therefore require more resources than a single slot allocates. In order to correctly run a multi-threaded job on titan you need to use a parallel environment. The parallel environment for SMP jobs is called "smp." You can request an SMP job using the '-pe' option when you qsub your job. The syntax for the '-pe' option is:

-pe smp <number_of_threads>

You can pass this option either through the command line like this:

qsub -pe smp 4

which would allocate enough cores for a 4 threaded job. Or, you can specify the same option in your qsub file like:

#$ -pe smp 4

The maximum number of threads you can request on titan is 8 since the titan nodes have 8 cores each. Because all of the cores must be allocated on the same node, requesting 8 cores will require an entire node to be free which may take longer to queue. In some cases, it may be more efficient to use fewer threads.

Finally, it's important to note that your memory request (-l mem=<mem>) is for each thread. If you are running a 4 threaded job that requires a total of 8GB of RAM, you should use:

-pe smp 4 -l mem=2G

Running an OpenMPI job

MPI (Message Passing Interface) is a method for running parallel jobs across multiple nodes. This gets past the problem mention in the SMP section of having to allocate all of the resources on a single node, and as such can scale in some cases to hundreds or thousands of cores for a single job. However, your program must be specifically programmed to use MPI.

The most common form of MPI found these days is OpenMPI. Which have current versions of OpenMPI supported on titan. In order to use OpenMPI you will need to use an 'parallel environment'. The parallel environment for OpenMPI is called "orte." As in the SMP case, the syntax is:

-pe orte <number_of_slaves>

where <number_of_slaves> is the number of cores requested for the MPI job.

In addition to running the job in the appropriate parallel environment, you need to run your job with mpirun. You need to setup your library path for OpenMPI and run the appropriate mpirun command. The MPI libraries are located in:

/opt/OFED/current 

(current is a symbolic link to the latest OFED version).

To setup your library path you need to add

LD_LIBRARY_PATH=/opt/OFED/current/lib64:$LD_LIBRARY_PATH

to your script.

The mpirun executable can be found in:

/opt/OFED/current/bin

SGE has "tight integration" with OpenMPI, so you shouldn't need to pass any parameters to mpirun. It will automatically know how many slaves it can run and where to run them.

As an example, say you want to run your program "mpiprog" on 32 cores. You might use the following script:

#!/bin/bash
#$ -cwd -S /bin/bash -N mpiprog
#$ -l mem=1G,time=6::
#$ -pe orte 32
LD_LIBRARY_PATH=/opt/OFED/current/lib64:$LD_LIBRARY_PATH
/opt/OFED/current/mpi/gcc/openmpi-1.4.2/bin/mpirun mpiprog

This will dispatch a 32 core OpenMPI job on the cluster.

Note: your memory resource request is the memory per slave, not the total memory of your job. The above example will allocate a total of 32GB of memory across the cluster.

If your MPI job is "communications bound" (i.e. pass many large messages or many very quick messages) it may be to your advantage to use the InfiniBand enabled section of the cluster. InfiniBand is a very high-speed, low latency interconnect that OpenMPI has optimizations to use. There are a total of 256 cores available that are linked with InfiniBand.

In order to use the InfiniBand portion of the cluster you will want to specify that your job needs the infiniband resource. To do this, simply add the following line to your file:

#$ -l infiniband=TRUE

That's it. OpenMPI should automatically figure out that all of your slaves have access to InfiniBand and will automatically select it as the default communication fabric.

Note: "ib" is an alias for the "infiniband" resource, also (0,1) map to (FALSE,TRUE), so the following also request the infiniband resource:

$# -l ib=1
$# -l infiniband=1
$# -l ib=TRUE

A few MPI programs are very picky about fragmentation (i.e. getting split across too many different nodes). If your job is picky about fragmentation you can use the orte_8 parallel environment that will ensure you get 8 CPU's per node for your job (i.e. whole nodes). This should be avoided if possible, because it may result in long wait times for your job to be scheduled.

A final note: running large parallel jobs can be a very particular task and often requires extra testing and care. For instance, you cannot always assume that your program will simple run faster if you run it across more slaves. In most cases, MPI enabled programs perform better and better as you add more slaves until a point. After that point, the jobs will actually start performing worse. It's a good practice to run several instances of your MPI job at different numbers of slaves the first time and compare performance, e.g. run jobs with 2,4,8,16,32,64,128 and 256 slaves and see how the performance scales.

Interested in enabling MPI in your program? The following links might be helpful:

Personal tools