Documentation/Where can I find Sun Grid Engine Info

From systems
Jump to: navigation, search
NOTE: This page is out of date. 

Rocks Cluster:


Gaia is our only Rocks Cluster (See http://www.rocksclusters.org).

A short primer on using Sun Grid Engine (within the Rocks cluster context), is available at:

http://www.rocksclusters.org/roll-documentation/sge/4.2/using-sge.html

Here is how to create a script for qsub.

http://www.it.uu.se/datordrift/maskinpark/albireo/gridengine.html



Some of the relevant SGE commands are (for further info see the man pages on any Cluster (Gaia & Titan):

To launch the graphical interface for Sun Grid Engine (which I rarely use.....):

qmon

To submit a job to SGE batch queues, use:

qsub submit.scr

To check the status of the jobs submitted to SGE, use:

qstat -f

To kill a job submitted to SGE, use:

qdel PID

To kill all jobs submitted by user imelvin, run

qdel -u imelvin

A couple of things to note:

(1) You have to tell SGE which shell to use. For instance, the line #$ -S /bin/sh must be present in the script file submitted to SGE. It is not enough to just have the standard #!/bin/sh at the top of the file.

(2) Any errors of the form:

rm_4664: p4_error: semget failed for setnum: 1

means that the maximum number of allowed semaphores on the master node have been created and the program you are trying to run cannot allocate a new semaphore for inter-process communication. This can often happen if you have been running code that didn't exit correctly, leaving semaphores (and probably shared memory segments) allocated.

You can run "ipcs" to see what interprocess communication (IPC) objects and data structures (shared memory, message queues, and semaphores) you have allocated on the head node. To see the same for all the cluster nodes, you can run "cluster-fork ipcs"

If leftover semaphores or shared memory segments are owned by you, you can remove them using by running:

For the master node: /opt/mpich/gnu/sbin/cleanipcs For the compute nodes: cluster-fork /opt/mpich/gnu/sbin/cleanipcs

BTW: The "ipcs" command will only list IPC objects and data structures owned by you. You have to run the command as root to get a listing for all users.

Example of submitting the run on a cluster:

[hans@frontend-0 hans]$ qsub test.scr your job 338449 ("test.scr") has been submitted

[hans@frontend-0 hans]$ qstat -u hans job-ID prior name user state submit/start at queue master ja-task-ID


338449     0 test.scr   hans         qw    02/10/2006 18:03:12                           

[hans@frontend-0 hans]$ qstat -u hans job-ID prior name user state submit/start at queue master ja-task-ID


338449     0 test.scr   hans         t     02/10/2006 18:03:15 compute-0- MASTER         

[hans@frontend-0 hans]$ qstat -u hans job-ID prior name user state submit/start at queue master ja-task-ID


338449     0 test.scr   hans         r     02/10/2006 18:03:16 compute-0- MASTER             

[hans@frontend-0 hans]$ cat test.scr.out compute-0-7.local Fri Feb 10 18:03:15 EST 2006 end of qsub script

Message Passing:

The Rocks Cluster installations supports LAM-MPI and MPICH.

The public domain version of MPI, MPICH, is maintained by Argonne National Lab and they provide lots of MPI information and examples on their web site:

MPI: http://www-unix.mcs.anl.gov/mpi/

http://www-unix.mcs.anl.gov/mpi/tutorial/index.html

MPICH: http://www-unix.mcs.anl.gov/mpi/mpich/

LAM-MPI: http://www.lam-mpi.org

Two nice tutorials, "Introduction to MPI" and "Intermediate MPI", are available from WebCT-HPC, the web-based training site for High Performance Computing sponsored by the National Center for Supercomputing Applications (NCSA). To access the courses (that can be downloaded as pdf files) you need to first register at:

http://foxtrot.ncsa.uiuc.edu:8900/webct/public/home.pl

Available courses:

http://foxtrot.ncsa.uiuc.edu:8900/webct/public/show_courses.pl


In the event that you have any questions or concerns, please open a ticket with C2B2 Systems Management by sending email to rt@c2b2.columbia.edu.