Hydra Cluster: Overview¶

Introduction¶

Hydra is the School of Computing's High Performance Computing cluster. It consists of a number of CPU resources and GPU resources. We use the Slurm Workload Manager to manage jobs within the cluster.

The resources are split in to two partitions (think of these as groups of machines); one for CPU only jobs, and one for jobs that require a GPU. The GPU partition can also be used for CPU jobs with permission.

The GPU partition contains seven servers with 18 GPUs of a mixture of architectures:

Three servers with 4x NVIDIA A100 80GB GPUs. They also have 2x Intel Gold 5317 CPUs running at 3.00GHz, with a total of 12 cores (48 threads), and 384GB of RAM.
One server with 1x NVIDIA TITAN V GPU. It has 2x Intel Xeon E5-2620 and 144GB of RAM.
Two servers with 2x NVIDIA Tesla P100 GPUs in each. They also each have 2x Intel Gold 6136 CPUs running at 3.00GHz, giving a total of 24 cores (48 threads) per server, and 256GB of RAM.
There's also an older server with an NVIDIA Tesla K40 GPU. This is slower but potentially useful for testing.

The CPU partition contains 18 servers each with 2x Intel Xeon E5520 CPUs running at 2.27GHz. Each server has 8 cores (16 threads), and between 12GB and 24GB of RAM. These servers are much older but can still provide a decent amount of processing power by virtue of having many cores in total. The partition also contains a few servers running within our cloud; we may increase or decrease the number of these during the year to make better use of spare resources within the cloud.

Requesting access¶

Access is currently open to staff and research postgraduate students within the School. You will need to contact us to get access and to discuss your requirements.

Getting started¶

The Slurm Quick Start User Guide provides a good tutorial for getting started with Slurm. We won't replicate its contents here, but rather we'll detail the specifics of our setup. It's well worth reading through it, and possibly the rest of the Slurm user documentation, along with the notes on this page.

To perform these steps you'll need to be able to log in to a server named hydra using SSH. You can also use myrtle or raptor if you prefer. If you've not logged in to any of these machines before then you'll first need to set a password. Then you can log in using an SSH client, for example PuTTY on Windows. If you're using a Mac, Linux, or other Unix-like systems with a command line ssh, you can simply type the following, replacing login with your own username.

ssh login@hydra.kent.ac.uk

Submitting jobs¶

Jobs can be submitted directly from hydra, or from myrtle or raptor. You don't need to log into any of the servers running the jobs. As detailed in the quickstart guide above you can use srun to submit a job and receive immediate output, or you can use sbatch to queue a job and have the output stored in a file. For example:

tdb@hydra:~$ srun hostname
cloud01
tdb@hydra:~$

tdb@hydra:~$ cat hostname.sh
#!/bin/sh
#SBATCH --mail-type=END
hostname
tdb@hydra:~$ sbatch hostname.sh
Submitted batch job 160
tdb@hydra:~$ # wait for the job to complete or email to arrive
tdb@hydra:~$ cat slurm-160.out
cloud01
tdb@hydra:~$

Partitions and requesting resources¶

The default partition is named test. This partition has a maximum 1 hour run time on jobs and is intended for testing out commands to make sure they behave as expected. This partition should always have capacity available and you shouldn't have to wait for long running jobs to complete. When you run a command like srun without any arguments, as above, it'll run in this partition. To request an alternative partition use the -p flag:

tdb@hydra:~$ srun -p cpu hostname
cloud02
tdb@hydra:~$ srun -p gpu hostname
pascal01
tdb@hydra:~$

By default your job will have 1GB of memory allocated. If you attempt to use more your job will be killed. To request more use the --mem flag to specify a larger amount. For example --mem=2G.

If your job requires a GPU, in addition to specifying the gpu partition you will also need to request the required GPU resources. This can consist of either 1 or 2 GPUs, and specify whether you want the Ampere generation (A100), Volta generation (TITAN V), Pascal generation (P100) or Kepler generation (K40) cards. For example:

tdb@hydra:~$ srun -p gpu --gres gpu:1 hostname
pascal01
tdb@hydra:~$ srun -p gpu --gres gpu:ampere:1 hostname
volta01
tdb@hydra:~$

Please note that the GPUs will only be available if you request them with --gres. It is not enough to just select the GPU partition.

If you are a taught student with special access then please use the gpu.stu partition when this document says gpu. The gpu.stu partition has a two day run time limit, and a lower limit on the number of GPUs that can be used.

Storage¶

Shared home directories are available on all the nodes in the cluster, and on myrtle and raptor. On myrtle and raptor your shared home directory is different to your normal home directory, so to access it you need to do the following:

tdb@myrtle:~% clusterdir
Your cluster home directory is:
/cluster/home/cur/tdb
tdb@myrtle:~$ cd /cluster/home/cur/tdb
tdb@myrtle:/cluster/home/cur/tdb$

On the hydra server, and on all the cluster nodes, this is your default home directory. You may find it more straightforward to use the hydra server to avoid needing to change directories.

Your directory is also available as \\hydra.kent.ac.uk\exports\cluster.

Any jobs you submit will run with the same working directory and environment as you have when you launch them. So if you change to your cluster home directory, or a sub-directory, before running jobs you will get more predictable results.

Each machine also has a temporary local area mounted on /scratch. You're free to use this space if you need to create temporary files or extract data sets as your job starts running. This storage will be faster than the shared home directories. Please make sure to clean up your files when your job completes. If you require a larger amount of storage in /scratch make sure add the --tmp flag, eg --tmp=1G, to request a machine with enough space.

Large storage¶

We additionally have large shared storage mounted at /data on all nodes. If you need to store large amounts of data then this might be more appropriate to use. It may also have better performance for I/O heavy tasks. Please get in touch with us if you'd like to use it.

Checking the queue¶

You can check the queue by using the squeue command. For example:

tdb@hydra:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               166      test hostname      tdb  R       0:04      1 cloud01
               167       cpu hostname      tdb  R       0:01      1 cloud02
tdb@hydra:~$

Checking GPU usage¶

Although you can't get a shell prompt on the HPC nodes, you can use SSH to log in and check the output of nvidia-smi. This can be a useful way to confirm that your code is indeed using a GPU.

First check the node that your job is running on:

tdb@hydra:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            254208       gpu  python3      tdb  R       0:04      1 ampere01
tdb@hydra:~$

Then log in to the node, in this case ampere01, to check the nvidia-smi output:

tdb@hydra:~$ ssh ampere01.hydra.kent.ac.uk
tdb@ampere01.hydra.kent.ac.uk's password:
Sat Dec 10 22:06:23 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.48.07    Driver Version: 515.48.07    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:17:00.0 Off |                    0 |
| N/A   36C    P0    62W / 300W |  79306MiB / 81920MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   35C    P0    43W / 300W |      0MiB / 81920MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   36C    P0    46W / 300W |      0MiB / 81920MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:E3:00.0 Off |                    0 |
| N/A   36C    P0    45W / 300W |      0MiB / 81920MiB |      0%   E. Process |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1539804      C   python3                         79304MiB |
+-----------------------------------------------------------------------------+
Connection to ampere01.hydra.kent.ac.uk closed.
tdb@hydra:~$

Here you can see that there is one python3 process using GPU 0. The other three GPUs are unused.

Available software¶

The Hydra nodes contain a number of software packages including Java, Julia, R, Python, Ocaml, and the usual Linux C compilers. We can likely install anything that's available in the Ubuntu 20.04 repositories.

The GPU machines additionally contain the Nvidia CUDA tools.

As a general rule, to keep things consistent, we will try to maintain the same versions on these machines as on myrtle and raptor. However, during upgrades they may diverge, in which case we'd recommend using the hydra server instead.

TensorFlow example¶

TensorFlow is a popular framework so we've put together a short example of how to use it on the Hydra cluster. It may also serve as a useful starting point for your own programs. The example can be found here.

Conda example¶

Anaconda is a great way to manage separate environments containing different versions of Python and related libraries. We have some example documentation here.

Recommendations¶

Here are some general recommendations to consider when creating jobs:

Break jobs down in to smaller units when possible. This allows them to be spread over more nodes and do more in parallel.
Try not to request more resources than you need as it will limit the number of jobs that can run on a node.
Think about the amount of available resources before launching a huge number of jobs. Maybe run a batch, wait for results, then run another.
Keep an eye on the queue and what colleagues are doing. Communicate with us and your colleagues if your jobs are impacting each other.

Questions¶

Slurm has an extensive set of options and features and we're still learning how best to make use of them. If you have questions or ideas on how things could be done better please let us know. Also please get in touch if you think we could have mentioned something else in this guide, or if anything cloud have been clearer.

As usual, please contact us with any queries.