Hydra Cluster: Overview¶
Hydra is the School of Computing's High Performance Computing cluster. It consists of a number of CPU resources and a small number of GPU resources. We use the Slurm Workload Manager to manage jobs within the cluster.
The resources are split in to two partitions (think of these as groups of machines); one for CPU only jobs, and one for jobs that require a GPU. The GPU partition can also be used for CPU jobs with permission.
The GPU partition contains five servers with 10 GPUs of a mixture of architectures:
- One server with 4x NVIDIA A100 80GB GPUs. It also has 2x Intel Gold 5317 CPUs running at 3.00GHz, with a total of 2 cores (48 threads), and 384GB of RAM.
- One server with 1x NVIDIA TITAN V GPU. It has 2x Intel Xeon E5-2620 and 144GB of RAM.
- Two servers with 2x NVIDIA Tesla P100 GPUs in each. They also each have 2x Intel Gold 6136 CPUs running at 3.00GHz, giving a total of 24 cores (48 threads) per server, and 256GB of RAM.
- There's also an older server with an NVIDIA Tesla K40 GPU. This is slower but potentially useful for testing.
The CPU partition contains 14 servers each with 2x Intel Xeon E5520 CPUs running at 2.27GHz. Each server has 8 cores (16 threads), and between 12GB and 24GB of RAM. These servers are much older but can still provide a decent amount of processing power by virtue of having many cores in total. The partition also contains a few servers running within our cloud; we may increase or decrease the number of these during the year to make better use of spare resources within the cloud.
Access is currently open to staff and research postgraduate students within the School. You will need to contact us to get access and to discuss your requirements.
The Slurm Quick Start User Guide provides a good tutorial for getting started with Slurm. We won't replicate its contents here, but rather we'll detail the specifics of our setup. It's well worth reading through it, and possibly the rest of the Slurm user documentation, along with the notes on this page.
To perform these steps you'll need to be able to log in to a server named hydra using SSH. You can also use myrtle or raptor if you prefer. If you've not logged in to any of these machines before then you'll first need to set a password. Then you can log in using an SSH client, for example PuTTY on Windows. If you're using a Mac, Linux, or other Unix-like systems with a command line ssh, you can simply type the following, replacing login with your own username.
Jobs can be submitted directly from hydra, or from myrtle or raptor. You don't need to log into any of the servers running the jobs. As detailed in the quickstart guide above you can use srun to submit a job and receive immediate output, or you can use sbatch to queue a job and have the output stored in a file. For example:
tdb@hydra:~$ srun hostname cloud01 tdb@hydra:~$ tdb@hydra:~$ cat hostname.sh #!/bin/sh #SBATCH --mail-type=END hostname tdb@hydra:~$ sbatch hostname.sh Submitted batch job 160 tdb@hydra:~$ # wait for the job to complete or email to arrive tdb@hydra:~$ cat slurm-160.out cloud01 tdb@hydra:~$
Partitions and requesting resources¶
The default partition is named test. This partition has a maximum 1 hour run
time on jobs and is intended for testing out commands to make sure they behave
as expected. This partition should always have capacity available and you
shouldn't have to wait for long running jobs to complete. When you run a command
like srun without any arguments, as above, it'll run in this partition. To
request an alternative partition use the
tdb@hydra:~$ srun -p cpu hostname cloud02 tdb@hydra:~$ srun -p gpu hostname pascal01 tdb@hydra:~$
By default your job will have 1GB of memory allocated. If you attempt to use
more your job will be killed. To request more use the
--mem flag to specify a
larger amount. For example
If your job requires a GPU, in addition to specifying the gpu partition you will also need to request the required GPU resources. This can consist of either 1 or 2 GPUs, and specify whether you want the Volta generation (TITAN V), Pascal generation (P100) or Kepler generation (K40) cards. For example:
tdb@hydra:~$ srun -p gpu --gres gpu:1 hostname pascal01 tdb@hydra:~$ srun -p gpu --gres gpu:volta:1 hostname volta01 tdb@hydra:~$
Please note that the GPUs will only be available if you request them with
--gres. It is not enough to just select the GPU partition.
Shared home directories are available on all the nodes in the cluster, and on myrtle and raptor. On myrtle and raptor your shared home directory is different to your normal home directory, so to access it you need to do the following:
tdb@myrtle:~% clusterdir Your cluster home directory is: /cluster/home/cur/tdb tdb@myrtle:~$ cd /cluster/home/cur/tdb tdb@myrtle:/cluster/home/cur/tdb$
On the hydra server, and on all the cluster nodes, this is your default home directory. You may find it more straightforward to use the hydra server to avoid needing to change directories.
Your directory is also available as
Any jobs you submit will run with the same working directory and environment as you have when you launch them. So if you change to your cluster home directory, or a sub-directory, before running jobs you will get more predictable results.
Each machine also has a temporary local area mounted on
/scratch. You're free
to use this space if you need to create temporary files or extract data sets as
your job starts running. This storage will be faster than the shared home
directories. Please make sure to clean up your files when your job
completes. If you require a larger amount of storage in
/scratch make sure add
--tmp flag, eg
--tmp=1G, to request a machine with enough space.
We additionally have large shared storage mounted at
/data on all nodes. If
you need to store large amounts of data then this might be more appropriate to
use. It may also have better performance for I/O heavy tasks. Please get in
touch with us if you'd like to use it.
Checking the queue¶
You can check the queue by using the squeue command. For example:
tdb@hydra:~$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 166 test hostname tdb R 0:04 1 cloud01 167 cpu hostname tdb R 0:01 1 cloud02 tdb@hydra:~$
The Hydra nodes contain a number of software packages including Java, Julia, R, Python, Ocaml, and the usual Linux C compilers. We can likely install anything that's available in the Ubuntu 20.04 repositories.
The GPU machines additionally contain the Nvidia CUDA tools.
As a general rule, to keep things consistent, we will try to maintain the same versions on these machines as on myrtle and raptor. However, during upgrades they may diverge, in which case we'd recommend using the hydra server instead.
TensorFlow is a popular framework so we've put together a short example of how to use it on the Hydra cluster. It may also serve as a useful starting point for your own programs. The example can be found here.
Here are some general recommendations to consider when creating jobs:
- Break jobs down in to smaller units when possible. This allows them to be spread over more nodes and do more in parallel.
- Try not to request more resources than you need as it will limit the number of jobs that can run on a node.
- Think about the amount of available resources before launching a huge number of jobs. Maybe run a batch, wait for results, then run another.
- Keep an eye on the queue and what colleagues are doing. Communicate with us and your colleagues if your jobs are impacting each other.
Slurm has an extensive set of options and features and we're still learning how best to make use of them. If you have questions or ideas on how things could be done better please let us know. Also please get in touch if you think we could have mentioned something else in this guide, or if anything cloud have been clearer.
As usual, please contact us with any queries.