The MCC Batch Queueing System
The Monash Campus Cluster (MCC) uses the Sun Grid Engine (SGE) for its batch queueing system. Although the MCC is made up of over 200 servers, the queueing system presents this cluster as a single system, thus hiding the unnecessary details from the user's view. Furthermore, the queueing system keeps track of the available CPU cores and other resources (RAM, etc.), ensuring that calculations from multiple users will run with minimal interference from each other, thus increasing their chances of successful completion. For example, if a server is already packed with running calculations, the queueing system will not assign further calculations to run on it, preventing oversubscription that can lead to poor performance.
What is a job?
On any of the login nodes (e.g.,
msgln6.its.monash.edu.au), users prepare their calculations as "jobs" and submit these to the queue for execution on the cluster. A job is specified as an ASCII text file that contains
- descriptive job information to enable SGE to select one or more appropriate servers to execute; and
- the sequence of commands to run as required by the calculation.
The essential job information are:
- job duration - the maximum time (rounded to the nearest hour) this job will take to finish
- job memory resource requirements - the maximum memory (expressed in GB) for your calculation to run successfully
- job CPU resource requirements - the number of CPU cores (integer value) needed by the calculations
It is important that accurate job information is provided, otherwise, the job may fail to complete. For example, if the job duration is set too short, the calculation will be terminated by the queueing system before the calculation completes.
A job may also contain these optional information:
- job name - useful when running multiple jobs, so users can tell them apart (must be a string with no white spaces)
- email notification - users may choose to receive an email when the job starts or finishes
- job output filename - specifies the file where the
stdoutwill be redirected to
- job error filename - specifies the file where the
stderrwill be redirected to
The best practice on the MCC is to submit all calculations to the queueing system. The login nodes are provided as staging areas for job preparation: copying data files, compiling your own programs, preparing job files, etc. It is also acceptable to test your programs before submitting to the queue, but please be aware that there are usually many other users logged into the front nodes (e.g.,
Job Control Commands
Key to effective use of an HPC cluster is understanding its command-line interface (CLI) for job submission and management. Essentially, you need to know how to (a) submit jobs; (b) monitor their progress; and (c) delete jobs as necessary.
qsub - submit jobs for execution
The qsub command is used as follows to submit a short simple job to the first available execute node (note that CTRL-D means using the CTRL and D keys together to terminate the input):
The standard output and standard error files will be written inside the "
example" directory. Unless you are running a large number of jobs, it is a good practice to place the job file along with the associated data files and executable files (if applicable), within a separate folder. This makes it easier to manage the results from different jobs.
The lines that starts with #$ specifies the job information that will be used by SGE to determine which server is best for this calculation.
#$ -S shellfullpath
specifies the shell used, the typical setting is:
#$ -S /bin/sh
#$ -l h_vmem=_G
specifies the maximum memory for each process of the job:
#$ -l h_vmem=2G
This means that each process created by your code will be guaranteed 2G of Virtual Memory. However if your calculation exceeds this amount, the scheduler may kill the process. You can not request more memory than exists on the compute server, i.e. you can not exceed the physical memory (RAM) of a machine, as swapping to disk is very inefficient.
#S -l h_rt=hh:mm:ss
specifies the total time for the entire job, not the combined CPU time of all processes in the job
for example, a 24-hour job is specified with
#$ -l h_rt=24:00:00
if the calculations do not complete in 24 hours, the job will be terminated by SGE
This command is also useful to keep files in the same folder where you submitted your job.
specifies that the current working directory where the job was submitted is to be used for input and output files
in the example above, the cwd is "
Output of your jobs can be directed to specific output files:
#$ -o stdout_filename
specifies the filename for the standard output, for example,
#$ -e stderr_filename
specifies the filename for the standard error, for example,
#$ -j y
specifies that standard error will also be directed to the file for the standard output
In addition, email notification is also possible.
#$ -M emailaddr
specifies the email address for the notification:
#$ -M email@example.com
#$ -m options
specifies email notification for the job (a - abort, b - begin, e - end):
#$ -m abe
If your job requires multiple CPU cores, you will need to specify a parallel environment. If you did not specify any parallel environment, the system assumes your job uses one CPU core. For parallel jobs, it is essential to specify the "
#$ -pe" setting so that the scheduler will assign sufficient CPU resources to your job.
We will define some terms to assist in the understanding of running of parallel jobs.
|Core||A core is an independent processing unit in a microprocessor. A modern processor will contain many cores, i.e. 4 core, 6, core etc. See http://en.wikipedia.org/wiki/Multi-core_processor for more information.|
|Processor/Socket||A 'processor' or 'socket' refers to a discrete chip on a computer motherboard.|
|Computer/Machine/Node||A computer running an Operating System. Many computers in the MCC have multiple processors, each with multiple cores, so that a single computer can have many cores. The largest machines on the MCC have 64 cores/node, but these are partner-shares and not available for general use. The computers in the MCC are connected via an Ethernet network, although some specialized machines will have fast optical interconnect.|
There are two parallel environments,depending upon whether you want your job to run within a single machine, or across several computers:
#$ -pe smp #
specifies CPU cores on a single machine
#$ -pe smp 4
The line above requests that four CPU cores be assigned to the job. The majority of the compute servers on MCC have eight CPU cores, so choosing the following will usually guarantee job execution.
#$ -pe smp 8
In the SGE scheduler, an environment variable NSLOTS is set to indicate how many cores you have been allocated.
See openMP for an example.
#$ -pe orte_adv #
specifies CPU cores across multiple machines
#$ -pe orte_adv 16
This assigns one or more compute nodes to provide the job with a total of 16 CPU cores. The scheduler will automatically find the best available configuration to run your jobs. For example, if there is an available server with 16 CPU cores, then this job will be assigned to run on that server. If what is available are two 8-core servers, then these will be used to run your job.
See MPI for an example of its use.
qstat & qdel - the status of your submitted job/s & job deletion
The snippet of interaction below depicts a user "
jsmith" using the
qdel commands. On submission of a job, the system assigns a unique Job ID, in this case,
4077633. Use this job ID as argument to the "
qdel" command when deleting the job. Use the
qdel command with care, as deletion cannot be undone. If a running job is deleted before it completes its sequence of commands, you may need to resubmit the job and let it restart from the beginning.
AMD or Intel CPU
proc" option to select the CPUs.
#$ -l proc=amd
specifies that this job is to run on an AMD Opteron server
#$ -l proc=intel
specifies that this job is to run on an Intel Xeon server
The default stack limit for jobs is 128MB. If your jobs need a higher stack size limit, you will need to specify a maximum stack limit as follows:
#$ -l h_stack=256M
Request for 256MB limit for stack. You will need to specify via "ulimit -s" within your job script.
#$ -l gpu=1
specifies that this job needs a single GPU card
#$ -l gpu=2
specifies that this job needs two GPU cards
Manual Queue Selection for Partner Share
There are many compute servers on the MCC which were procured through a co-funding partnership with research groups. These research groups have prioritised access to these servers. From the time we deployed these servers, users from these research groups submit their jobs to a specific queue. This is specified as follows:
#$ -q queuename
For our partners, there are specific queues and please contact us for further information
General users are not advised to set a queue for their jobs, as this can delay when their jobs start.
Array Jobs - parameterisation
Alternatively, you can set this inside the job.sh file:
This will effectively create 10 sub-jobs, each job will have a different value for
$SGE_TASK_ID automatically generated from 1 to 10. Further information is available at: http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto
$SGE_TASK_ID. To use this for multiple-valued parameters, you will need some head and tail operation on a text file containing parameters. Say you have two parameters for job, x and y, and they are stored in a file called
So you need to simply extract the line using the $SGE_TASK_ID from LIST.
then run the program with:
Note that the setting:
applies for each job and not for the entire lot. The maximum number of sub-jobs per job is 75,000.
Comparison of SGE and PBS options
Monash users may have access to other compute clusters than use a different queueing system, e.g. MASSIVE, VLSCI and the NCI. One such system is the Portable Batch System (PBS). Here is a brief comparison of SGE and PBS options. Remember to look at the documentation of the system you are using, as the commands may vary depending upon local configuration.
|identifies the hard queue for the job, NOT recommended|
specifies a name for the job, must start with a letter and
contain letters and numbers (no spaces)
|join the stdout and stderr to the same output file|
|specifies a filename for the stdout|
|specifies a filename for the stderr|
|use the current working directory for inputs and outputs|
-pe parenv #
|specifies the parallel environment|
-l resource list
specifies memory per process per node
specifies wall time limit (time from start to finish of the job)
specifies the CPU type (AMD or Intel) for the job
-l resource list
Scheduling and Resource Allocation
Watch this space as a new scheduler is now going to come online soon.
- No labels