CoSMoS Cluster

The CoSMoS cluster consists of 12 nodes, each has two four-core Xeon E5520 processors (16 hardware threads total) and 12 GiB of RAM. Machines run Ubuntu 12.04LTS with a minimal set of packages installed. Home directories are shared between machines via NFS and mounted at /pi/home. The NFS mount is not designed for high I/O throughput so I/O intensive jobs should use local storage. Temporary local storage is available in /tmp.

To gain access to the cluster contact Carl Ritson (c.g.ritson@kent.ac.uk) or Fred Barnes (f.r.m.barnes@kent.ac.uk).

Login

Login over SSH to one of the publicly accessible unix hosts: raptor, myrtle or swallow. From one of the above hosts login over SSH to one of the cluster nodes:

Submitting Jobs

In order to manage cluster resources between multiple users (and prevent nodes from becoming overloaded) all jobs must be submitted via the Grid Engine.

The simplest way to submit a job is to create a shell script which encapsulates it and call the qsub command. For example, assuming your job is in a script job.sh:

$ qsub job.sh 

This will execute your job on the first available node (when resources become available).

Memory Allocation

By default each job has a (soft) memory limit of 1GiB. If you know your job requires more or less memory then you can pass this as a parameter to qsub. This allows the optimal number of jobs to run on a node without overloading its memory which leads to thrashing of swap and significantly reduced performance for all jobs. For example, requesting 2GiB of memory (virtual free) can be done as follows:

$ qsub -l vf=2G job.sh

Long Running Jobs

On the standard queue jobs are limited to 3 days of wall clock time before they are terminated. Long running jobs should be avoided as they are more likely to be terminated unexpectedly by hardware failure or maintenance; however, if required, e.g. a control task which submits other shorter jobs, then they should be submitted to the cosmos.hv.q queue. For example:

$ qsub -q cosmos.hv.q job.sh

File Sizes and Heavy Jobs

To reduce the risk of run away jobs filling the NFS filesystem the standard job queue has a hard file size limit of 10GiB. If your job will need to create files bigger than 10GB then please run it in the cosmos.hv.q queue. For example:

$ qsub -q cosmos.hv.q job.sh

The cosmos.hv.q should be use for other heavy weight jobs (many threads, lots of memory or high I/O) which might otherwise interfere with other jobs running on the same node. This allows heavy jobs to be scheduled differently and everyone to get better performance.

Array Jobs

To simplify running many iterations of job with different parameters, the Grid Engine provides array jobs. For example, to run the job.sh 100 times:

$ qsub -t 1-100 job.sh

For each instance the environment variable SGE_TASK_ID is set to the iteration number, i.e. 1 through 100. For more information on array jobs see: http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto

Alternatively consult the qsub manual page: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qsub.html

Viewing Jobs

Use the qstat command to check the present status of your jobs.

The present list of all jobs (for all users) can be viewed using:

$ qstat -u '*'

The present load of the cluster can be viewed with:

$ qhost

For more information see the qstat manual page: http://gridscheduler.sourceforge.net/htmlman/htmlman1/qstat.html

Modifying Jobs

Once submitted jobs can be removed with the qdel command:

$ qdel <job-number>

Job can also be held (qhold), released (qrls) or altered (qalter). Please see the relevant manual pages for more information.

Support

For help with the cluster or using grid engine please contact Carl Ritson (c.g.ritson@kent.ac.uk) or Fred Barnes (f.r.m.barnes@kent.ac.uk).

Acknowledgements

For 3rd parties using the cluster, please consider putting an appropriate acknowledgement in papers or other outputs to which use of the cluster contributed. A suitable stock sentence would be: "We acknowledge the support of concurrency researchers at Kent for access to the `CoSMoS' cluster, funded by EPSRC grants EP/E049419/1 and EP/E053505/1."

Local management

CoSMoSCluster (last edited 2015-06-29 10:51:40 by cgr)