School of Computing Hydra Cluster

Introduction

Hydra is the School of Computing's High Performance Computing cluster. It consists of a number of CPU resources and a small number of GPU resources. We use the Slurm Workload Manager to manage jobs within the cluster.

The resources are split in to two partitions (think of these as groups of machines); one for CPU only jobs, and one for jobs that require a GPU. The GPU partition can also be used for CPU jobs with permission.

The GPU partition contains four servers. Two of them each have 2x Intel Gold 6136 CPUs running at 3.00GHz, giving a a total of 24 cores (48 threads) per server, 256GB of RAM, and 2x Nvidia Tesla P100 GPUs. There's also a server with an Nvidia TITAN V, and another with an Nvidia Tesla K40 GPU.

The CPU partition contains 17 servers each with 2x Intel Xeon E5520 CPUs running at 2.27GHz. Each server has 8 cores (16 threads), and between 12GB and 24GB of RAM. These servers are much older but can still provide a decent amount of processing power by virtue of having many cores in total. The partition also contains a few servers running within our cloud; we may increase or decrease the number of these during the year to make better use of spare resources within the cloud.

Requesting access

Access is currently open to staff and research postgraduate students within the School of Computing. You will need to contact us to get access and to discuss your requirements.

Getting started

The Slurm Quick Start User Guide provides a good tutorial for getting started with Slurm. We won't replicate its contents here, but rather we'll detail the specifics of our setup. It's well worth reading through it, and possibly the rest of the Slurm user documentation, along with the notes on this page.

To perform these steps you'll need to be able to log in to myrtle or raptor using SSH. If you've not done this before then you'll first need to set a password. Then you can log in using an SSH client, for example PuTTY on Windows. If you're using a Mac, Linux, or other Unix-like systems with a command line ssh, you can simply type the following, replacing login with your own username, and host with either myrtle or raptor as appropriate.

ssh login@host.kent.ac.uk

Submitting jobs

Jobs can be submitted directly from myrtle or raptor. You don't need to log in to any of the servers running the jobs. As detailed in the quickstart guide above you can use srun to submit a job and receive immediate output, or you can use sbatch to queue a job and have the output stored in a file. For example:

tdb@myrtle:/cluster/home/cur/tdb$ srun hostname
cloud01
tdb@myrtle:/cluster/home/cur/tdb$
tdb@myrtle:/cluster/home/cur/tdb$ cat hostname.sh
#!/bin/sh
#SBATCH --mail-type=END
hostname
tdb@myrtle:/cluster/home/cur/tdb$ sbatch hostname.sh
Submitted batch job 160
tdb@myrtle:/cluster/home/cur/tdb$ # wait for the job to complete or email to arrive
tdb@myrtle:/cluster/home/cur/tdb$ cat slurm-160.out
cloud01
tdb@myrtle:/cluster/home/cur/tdb$

Partitions and requesting resources

The default partition is named test. This partition has a maximum 1 hour run time on jobs and is intended for testing out commands to make sure they behave as expected. This partition should always have capacity available and you shouldn't have to wait for long running jobs to complete. When you run a command like srun without any arguments, as above, it'll run in this partition. To request an alternative partition use the -p flag:

tdb@myrtle:/cluster/home/cur/tdb$ srun -p cpu hostname
cloud02
tdb@myrtle:/cluster/home/cur/tdb$ srun -p gpu hostname
kepler01
tdb@myrtle:/cluster/home/cur/tdb$

By default your job will have 1GB of memory allocated. If you attempt to use more your job will be killed. To request more use the --mem flag to specify a larger amount. For example --mem=2G.

If your job requires a GPU, in addition to specifying the gpu partition you will also need to request the required GPU resources. This can consist of either 1 or 2 GPUs, and specify whether you want the Volta generation (TITAN V), Pascal generation (P100) or Kepler generation (K40) cards. For example:

tdb@myrtle:/cluster/home/cur/tdb$ srun -p gpu --gres gpu:1 hostname
kepler01
tdb@myrtle:/cluster/home/cur/tdb$ srun -p gpu --gres gpu:pascal:1 hostname
pascal01
tdb@myrtle:/cluster/home/cur/tdb$

Please note that the GPUs will only be available if you request them with --gres. It is not enough to just select the GPU partition.

Storage

Shared home directories are available on all the nodes in the cluster, and on myrtle and raptor. To access your cluster home directory change to it as follows:

tdb@myrtle:~$ pwd
/home/cur/tdb
tdb@myrtle:~$ cd /cluster/home/cur/tdb
tdb@myrtle:/cluster/home/cur/tdb$

Your directory is also available as \\csresws\exports\cluster or \\raptor\exports\cluster.

Any jobs you submit will run with the same working directory and environment as you have when you launch them. So if you change to your cluster home directory, or a sub-directory, before running jobs you will get more predictable results.

Each machine also has a temporary local area mounted on /scratch. You're free to use this space if you need to create temporary files or extract data sets as your job starts running. This storage will be faster than the shared home directories. Please make sure to clean up your files when your job completes. If you require a larger amount of storage in /scratch make sure add the --tmp flag, eg --tmp=1G, to request a machine with enough space.

Checking the queue

You can check the queue by using the squeue command. For example:

tdb@myrtle:/cluster/home/cur/tdb$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               166      test hostname      tdb  R       0:04      1 cloud01
               167       cpu hostname      tdb  R       0:01      1 cloud02
tdb@myrtle:/cluster/home/cur/tdb$

Available software

The hydra nodes contain a number of software packages including Java, Julia, R, Python, Ocaml, and the usual Linux C compilers. We can likely install anything that's available in the Ubuntu 16.04 repositories.

The GPU machines additionally contain the Nvidia CUDA tools.

As a general rule, to keep things consistent, we will try to maintain the same versions on these machines as on myrtle and raptor.

TensorFlow example

TensorFlow is a popular framework so we've put together a short example of how to use it on the Hydra cluster. It may also serve as a useful starting point for your own programs. The example can be found here.

Recommendations

Here are some general recommendations to consider when creating jobs:

  • Break jobs down in to smaller units when possible. This allows them to be spread over more nodes and do more in parallel.
  • Try not to request more resources than you need as it will limit the number of jobs that can run on a node.
  • Think about the amount of available resources before launching a huge number of jobs. Maybe run a batch, wait for results, then run another.
  • Keep an eye on the queue and what colleagues are doing. Communicate with us and your colleagues if your jobs are impacting each other.

Questions?

Slurm has an extensive set of options and features and we're still learning how best to make use of them. If you have questions or ideas on how things could be done better please let us know. Also please get in touch if you think we could have mentioned something else in this guide, or if anything cloud have been clearer.

As usual, please contact us via cs-syshelp for any queries.