Child pages
  • The Queueing System
Skip to end of metadata
Go to start of metadata
The Queueing System

 

The MCC Batch Queueing System

The Monash Campus Cluster (MCC) uses the Sun Grid Engine (SGE) for its batch queueing system. Although the MCC is made up of over 200 servers, the queueing system presents this cluster as a single system, thus hiding the unnecessary details from the user's view. Furthermore, the queueing system keeps track of the available CPU cores and other resources (RAM, etc.), ensuring that calculations from multiple users will run with minimal interference from each other, thus increasing their chances of successful completion. For example, if a server is already packed with running calculations, the queueing system will not assign further calculations to run on it, preventing oversubscription that can lead to poor performance.

What is a job?

On any of the login nodes (e.g., msgln6.its.monash.edu.au), users prepare their calculations as "jobs" and submit these to the queue for execution on the cluster. A job is specified as an ASCII text file that contains

  • descriptive job information to enable SGE to select one or more appropriate servers to execute; and
  • the sequence of commands to run as required by the calculation.

The essential job information are:

  • job duration - the maximum time (rounded to the nearest hour) this job will take to finish
  • job memory resource requirements - the maximum memory (expressed in GB) for your calculation to run successfully
  • job CPU resource requirements - the number of CPU cores (integer value) needed by the calculations

It is important that accurate job information is provided, otherwise, the job may fail to complete. For example, if the job duration is set too short, the calculation will be terminated by the queueing system before the calculation completes.

A job may also contain these optional information:

  • job name - useful when running multiple jobs, so users can tell them apart (must be a string with no white spaces)
  • email notification - users may choose to receive an email when the job starts or finishes
  • job output filename - specifies the file where the stdout will be redirected to
  • job error filename - specifies the file where the stderr will be redirected to

The best practice on the MCC is to submit all calculations to the queueing system. The login nodes are provided as staging areas for job preparation: copying data files, compiling your own programs, preparing job files, etc. It is also acceptable to test your programs before submitting to the queue, but please be aware that there are usually many other users logged into the front nodes (e.g., msgln4, and msgln6).

Job Control Commands

Key to effective use of an HPC cluster is understanding its command-line interface (CLI) for job submission and management. Essentially, you need to know how to (a) submit jobs; (b) monitor their progress; and (c) delete jobs as necessary.

qsub - submit jobs for execution

The qsub command is used as follows to submit a short simple job to the first available execute node (note that CTRL-D means using the CTRL and D keys together to terminate the input):

jsmith@msgln6> mkdir example
jsmith@msgln6> cd example
jsmith@msgln6 ~/example> cat > simple.sh
#!/bin/sh
#$ -S /bin/sh
#$ -l h_vmem=1G
#$ -l h_rt=10:00:00
#$ -cwd

/bin/hostname
CTRL-D

jsmith@msgln6 ~/example> qsub simple.sh

The standard output and standard error files will be written inside the "example" directory. Unless you are running a large number of jobs, it is a good practice to place the job file along with the associated data files and executable files (if applicable), within a separate folder. This makes it easier to manage the results from different jobs.

The lines that starts with #$ specifies the job information that will be used by SGE to determine which server is best for this calculation.

Command TemplateMeaning
#$ -S shellfullpath

specifies the shell used, the typical setting is:

#$ -S /bin/sh
#$ -l h_vmem=_G

specifies the maximum memory for each process of the job:

#$ -l h_vmem=2G

This means that each process created by your code will be guaranteed 2G of Virtual Memory. However if your calculation exceeds this amount, the scheduler may kill the process. You can not request more memory than exists on the compute server, i.e. you can not exceed the physical memory (RAM) of a machine, as swapping to disk is very inefficient.

#S -l h_rt=hh:mm:ss

specifies the total time for the entire job, not the combined CPU time of all processes in the job

for example, a 24-hour job is specified with

#$ -l h_rt=24:00:00

if the calculations do not complete in 24 hours, the job will be terminated by SGE

This command is also useful to keep files in the same folder where you submitted your job.

CommandMeaning
#$ -cwd

specifies that the current working directory where the job was submitted is to be used for input and output files

in the example above, the cwd is "example" and hence the job outputs will be generated inside this folder

Output of your jobs can be directed to specific output files:

Command TemplateMeaning
#$ -o stdout_filename

specifies the filename for the standard output, for example,

#$ -o outfile.txt

#$ -e stderr_filename

specifies the filename for the standard error, for example,

#$ -e outfile.err

#$ -j y

specifies that standard error will also be directed to the file for the standard output

In addition, email notification is also possible.

Command TemplateMeaning
#$ -M emailaddr

specifies the email address for the notification:

#$ -M philip.chan@monash.edu
#$ -m options

specifies email notification for the job (a - abort, b - begin, e - end):

#$ -m abe

Parallel Jobs

If your job requires multiple CPU cores, you will need to specify a parallel environment. If you did not specify any parallel environment, the system assumes your job uses one CPU core. For parallel jobs, it is essential to specify the "#$ -pe" setting so that the scheduler will assign sufficient CPU resources to your job.

 

We will define some terms to assist in the understanding of running of parallel jobs.

TermMeaning
CoreA core is an independent processing unit in a microprocessor. A modern processor will contain many cores, i.e. 4 core, 6, core etc.  See http://en.wikipedia.org/wiki/Multi-core_processor for more information.
Processor/SocketA 'processor' or  'socket' refers to a discrete chip on a computer motherboard.
Computer/Machine/NodeA computer running an Operating System. Many computers in the MCC have multiple processors, each with multiple cores, so that a single computer can have  many cores. The largest machines on the MCC have  64 cores/node, but these are partner-shares and not available for general use.  The computers in the MCC are connected via an Ethernet network, although some specialized machines will have fast optical interconnect.

 

There are two parallel environments,depending upon whether you want your job to run within a single machine, or across several computers:

CommandMeaning
#$ -pe smp #

specifies CPU cores on a single machine

#$ -pe smp 4

The line above requests that four CPU cores be assigned to the job. The majority of the compute servers on MCC have eight CPU cores, so choosing the following will usually guarantee job execution.

#$ -pe smp 8

In the SGE scheduler, an environment variable NSLOTS is set to indicate how many cores you have been allocated.

See openMP for an example.

#$ -pe orte_adv #

specifies CPU cores across multiple machines

#$ -pe orte_adv 16

This assigns one or more compute nodes to provide the job with a total of 16 CPU cores. The scheduler will automatically find the best available configuration to run your jobs. For example, if there is an available server with 16 CPU cores, then this job will be assigned to run on that server. If what is available are two 8-core servers, then these will be used to run your job.

See  MPI for an example of its use.

qstat & qdel - the status of your submitted job/s & job deletion

The snippet of interaction below depicts a user "jsmith" using the qsub, qstat and qdel commands. On submission of a job, the system assigns a unique Job ID, in this case, 4077633. Use this job ID as argument to the "qdel" command when deleting the job. Use the qdel command with care, as deletion cannot be undone. If a running job is deleted before it completes its sequence of commands, you may need to resubmit the job and let it restart from the beginning.

Example of qdel
jsmith@msgln6 ~/example> qsub simple.sh 
Your job 4077633 ("simple.sh") has been submitted

jsmith@msgln6 ~/example> qstat
job-ID  prior   name       user      state submit/start at     queue         slots ja-task-ID 
---------------------------------------------------------------------------------------------
4077633 0.00000 simple.sh  jsmith     qw   10/31/2013 11:04:20                 1        

jsmith@msgln6 ~/example> qstat
job-ID  prior   name       user      state submit/start at     queue         slots ja-task-ID 
---------------------------------------------------------------------------------------------
4077633 0.00000 simple.sh  jsmith      r   10/31/2013 11:15:20 aqu8@gn101...   1   
jsmith@msgln6 ~/example> qdel 4077633
jsmith has deleted job 4077633

jsmith@msgln6 ~/example> qstat

Advanced Settings

AMD or Intel CPU

The MCC is a heterogeneous cluster and there are both AMD Opteron and Intel Xeon compute servers. On the occasion when you have optimised your job to run on a specific CPU, you can use the "proc" option to select the CPUs.
CommandMeaning
#$ -l proc=amd

specifies that this job is to run on an AMD Opteron server

#$ -l proc=intel

specifies that this job is to run on an Intel Xeon server

Stack Limit

The default stack limit for jobs is 128MB. If your jobs need a higher stack size limit, you will need to specify a maximum stack limit as follows:

CommandMeaning
#$ -l h_stack=256M

Request for 256MB limit for stack. You will need to specify via "ulimit -s" within your job script.

GPU

The MCC has ten compute nodes with GPU capability. Specifically, each of these nodes have two Tesla C1060 GPGPU cards installed. To use these, you will need to provide an extra resource setting in your job:
CommandMeaning
#$ -l gpu=1

specifies that this job needs a single GPU card

#$ -l gpu=2

specifies that this job needs two GPU cards

Manual Queue Selection for Partner Share

There are many compute servers on the MCC which were procured through a co-funding partnership with research groups. These research groups have prioritised access to these servers. From the time we deployed these servers, users from these research groups submit their jobs to a specific queue. This is specified as follows:

CommandMeaning
#$ -q queuename

For our partners, there are specific queues and please contact us for further information

General users are not advised to set a queue for their jobs, as this can delay when their jobs start.

Array Jobs - parameterisation 

Often users need to run the same algorithm using a number of different input parameters.  Monash has a tool, Nimrod, that automates many of steps involved in running large parametric jobs. See http://messagelab.monash.edu.au/Nimrod.
Another approach is to use a built-in feature of the SGE system called array jobs. The idea is that you only need to use a single qsub which generates a list of jobs from a given range of integers. This is done via the -t parameter of qsub. For example, to submit 10 instances of a job (assuming your job script is called job.sh), use this command:

 

Example of an array job
 qsub -t 1-10 job.sh

Alternatively, you can set this inside the job.sh file:

 

Array job via a SGE command
 #$ -t 1-10

This will effectively create 10 sub-jobs, each job will have a different value for $SGE_TASK_ID automatically generated from 1 to 10. Further information is available at: http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto

The parameter is only the integer value set in the $SGE_TASK_ID.   To use this for multiple-valued parameters, you will need some head and tail operation on a text file containing parameters. Say you have two parameters for job, x and y, and they are stored in a file called LIST

 

Example of a parameter file LIST
 1.0 110
 2.0 210
 3.0 430
   :      

So you need to simply extract the line using the $SGE_TASK_ID from LIST.

 

How to create multiple parameter variables
PARAMS=`head -n $SGE_TASK_ID LIST | tail -1`

then run the program with:

 

Run command
program.exe $PARAMS

Note that the setting:

 

SGE Options
 #$ -l h_rt=hh:mm:ss

applies for each job and not for the entire lot.  The maximum number of sub-jobs per job is 75,000.

Comparison of SGE and PBS options

Monash users may have access to other compute clusters than use a different queueing system, e.g. MASSIVE, VLSCI and the NCI. One such system is the Portable Batch System (PBS). Here is a brief comparison of SGE and PBS options.  Remember to look at the documentation of the system you are using, as the commands may vary depending upon local configuration.

#$ 
Description
#PBS
-q queue_name
identifies the hard queue for the job, NOT recommended
-q queue_name
-N name

specifies a name for the job, must start with a letter and

contain letters and numbers (no spaces)

-N name
-j y
join the stdout and stderr to the same output file
-j oe
-o output
specifies a filename for the stdout
-o output
-e error
specifies a filename for the stderr
-e error
-cwd
use the current working directory for inputs and outputs
-l wd
-pe parenv #
specifies the parallel environment
-l ncpus=
-l ppn=
-l resource list
-l h_vmem=nG
-l h_rt=hh:mm:ss
-l proc=(amd|intel) 

 

specifies memory per process per node

specifies wall time limit (time from start to finish of the job)

specifies the CPU type (AMD or Intel) for the job

-l resource list
-l mem=  
-l walltime=

Scheduling and Resource Allocation

Watch this space as a new scheduler is now going to come online soon.

  • No labels