Hydra Cluster: Overview¶
Hydra is the School of Computing's High Performance Computing cluster. It consists of a number of CPU resources and a small number of GPU resources. We use the Slurm Workload Manager to manage jobs within the cluster.
The resources are split in to two partitions (think of these as groups of machines); one for CPU only jobs, and one for jobs that require a GPU. The GPU partition can also be used for CPU jobs with permission.
The GPU partition contains four servers. Two of them each have 2x Intel Gold 6136 CPUs running at 3.00GHz, giving a a total of 24 cores (48 threads) per server, 256GB of RAM, and 2x Nvidia Tesla P100 GPUs. There's also a server with an Nvidia TITAN V, and another with an Nvidia Tesla K40 GPU.
The CPU partition contains 14 servers each with 2x Intel Xeon E5520 CPUs running at 2.27GHz. Each server has 8 cores (16 threads), and between 12GB and 24GB of RAM. These servers are much older but can still provide a decent amount of processing power by virtue of having many cores in total. The partition also contains a few servers running within our cloud; we may increase or decrease the number of these during the year to make better use of spare resources within the cloud.
Access is currently open to staff and research postgraduate students within the School. You will need to contact us to get access and to discuss your requirements.
The Slurm Quick Start User Guide provides a good tutorial for getting started with Slurm. We won't replicate its contents here, but rather we'll detail the specifics of our setup. It's well worth reading through it, and possibly the rest of the Slurm user documentation, along with the notes on this page.
To perform these steps you'll need to be able to log in to myrtle or raptor using SSH. If you've not done this before then you'll first need to set a password. Then you can log in using an SSH client, for example PuTTY on Windows. If you're using a Mac, Linux, or other Unix-like systems with a command line ssh, you can simply type the following, replacing login with your own username, and host with either myrtle or raptor as appropriate.
Jobs can be submitted directly from myrtle or raptor. You don't need to log in to any of the servers running the jobs. As detailed in the quickstart guide above you can use srun to submit a job and receive immediate output, or you can use sbatch to queue a job and have the output stored in a file. For example:
tdb@myrtle:/cluster/home/cur/tdb$ srun hostname cloud01 tdb@myrtle:/cluster/home/cur/tdb$ tdb@myrtle:/cluster/home/cur/tdb$ cat hostname.sh #!/bin/sh #SBATCH --mail-type=END hostname tdb@myrtle:/cluster/home/cur/tdb$ sbatch hostname.sh Submitted batch job 160 tdb@myrtle:/cluster/home/cur/tdb$ # wait for the job to complete or email to arrive tdb@myrtle:/cluster/home/cur/tdb$ cat slurm-160.out cloud01 tdb@myrtle:/cluster/home/cur/tdb$
Partitions and requesting resources¶
The default partition is named test. This partition has a maximum 1 hour run
time on jobs and is intended for testing out commands to make sure they behave
as expected. This partition should always have capacity available and you
shouldn't have to wait for long running jobs to complete. When you run a command
like srun without any arguments, as above, it'll run in this partition. To
request an alternative partition use the
tdb@myrtle:/cluster/home/cur/tdb$ srun -p cpu hostname cloud02 tdb@myrtle:/cluster/home/cur/tdb$ srun -p gpu hostname kepler01 tdb@myrtle:/cluster/home/cur/tdb$
By default your job will have 1GB of memory allocated. If you attempt to use
more your job will be killed. To request more use the
--mem flag to specify a
larger amount. For example
If your job requires a GPU, in addition to specifying the gpu partition you will also need to request the required GPU resources. This can consist of either 1 or 2 GPUs, and specify whether you want the Volta generation (TITAN V), Pascal generation (P100) or Kepler generation (K40) cards. For example:
tdb@myrtle:/cluster/home/cur/tdb$ srun -p gpu --gres gpu:1 hostname kepler01 tdb@myrtle:/cluster/home/cur/tdb$ srun -p gpu --gres gpu:pascal:1 hostname pascal01 tdb@myrtle:/cluster/home/cur/tdb$
Please note that the GPUs will only be available if you request them with
--gres. It is not enough to just select the GPU partition.
Shared home directories are available on all the nodes in the cluster, and on myrtle and raptor. To access your cluster home directory change to it as follows:
tdb@myrtle:~% clusterdir Your cluster home directory is: /cluster/home/cur/tdb tdb@myrtle:~$ cd /cluster/home/cur/tdb tdb@myrtle:/cluster/home/cur/tdb$
Your directory is also available as
Any jobs you submit will run with the same working directory and environment as you have when you launch them. So if you change to your cluster home directory, or a sub-directory, before running jobs you will get more predictable results.
Each machine also has a temporary local area mounted on
/scratch. You're free
to use this space if you need to create temporary files or extract data sets as
your job starts running. This storage will be faster than the shared home
directories. Please make sure to clean up your files when your job
completes. If you require a larger amount of storage in /scratch make sure add
--tmp flag, eg
--tmp=1G, to request a machine with enough space.
Checking the queue¶
You can check the queue by using the squeue command. For example:
tdb@myrtle:/cluster/home/cur/tdb$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 166 test hostname tdb R 0:04 1 cloud01 167 cpu hostname tdb R 0:01 1 cloud02 tdb@myrtle:/cluster/home/cur/tdb$
The Hydra nodes contain a number of software packages including Java, Julia, R, Python, Ocaml, and the usual Linux C compilers. We can likely install anything that's available in the Ubuntu 20.04 repositories.
The GPU machines additionally contain the Nvidia CUDA tools.
As a general rule, to keep things consistent, we will try to maintain the same versions on these machines as on myrtle and raptor.
TensorFlow is a popular framework so we've put together a short example of how to use it on the Hydra cluster. It may also serve as a useful starting point for your own programs. The example can be found here.
Here are some general recommendations to consider when creating jobs:
- Break jobs down in to smaller units when possible. This allows them to be spread over more nodes and do more in parallel.
- Try not to request more resources than you need as it will limit the number of jobs that can run on a node.
- Think about the amount of available resources before launching a huge number of jobs. Maybe run a batch, wait for results, then run another.
- Keep an eye on the queue and what colleagues are doing. Communicate with us and your colleagues if your jobs are impacting each other.
Slurm has an extensive set of options and features and we're still learning how best to make use of them. If you have questions or ideas on how things could be done better please let us know. Also please get in touch if you think we could have mentioned something else in this guide, or if anything cloud have been clearer.
As usual, please contact us with any queries.