Hydra Cluster: Ollama¶

Introduction¶

This document walks through the basics of running Ollama on the Hydra cluster.

Downloading the container image for use in Apptainer¶

The following downloads the Ollama container image from Docker Hub and saves it locally in a format that Apptainer can use.

tdb@hydra:~$ apptainer pull ollama.sif docker://ollama/ollama
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob f6b71baa717c done
Copying blob 13b7e930469f done
Copying blob 97ca0261c313 done
Copying blob f6a9ed9582e4 done
Copying config 350459d6b6 done
Writing manifest to image destination
Storing signatures
2025/06/11 14:21:37  info unpack layer: sha256:13b7e930469f6d3575a320709035c6acf6f5485a76abcf03d1b92a64c09c2476
2025/06/11 14:21:39  info unpack layer: sha256:97ca0261c3138237b4262306382193974505ab6967eec51bbfeb7908fb12b034
2025/06/11 14:21:39  info unpack layer: sha256:f6a9ed9582e4fe45b37963606e0f3b1d2741580e399f984dceae124d3e570b2b
2025/06/11 14:21:40  info unpack layer: sha256:f6b71baa717ca7f46f6643a7a515acf8886ef09495b1419cf3a74948c2f1daca
INFO:    Creating SIF file...
tdb@hydra:~$

This will take a few minutes. It could also be repeated if a newer version of the container is needed in the future.

Starting the server process¶

By default the container runs the serve command, so it can be run through Slurm as follows:

tdb@hydra:~$ srun -p gpu --gres gpu:1 apptainer run --nv ollama.sif
Couldn't find '/home/cur/tdb/.ollama/id_ed25519'. Generating new private key.
Your new public key is:

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIBaGs8YJ6Lz23fcd4AqLz75v0uTD0KVrFR8onkyRWjbs

time=2025-06-11T14:29:23.871+01:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL:0 HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/home/cut/tdb1/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:0 http_proxy: https_proxy: no_proxy:]"
time=2025-06-11T14:29:23.875+01:00 level=INFO source=images.go:479 msg="total blobs: 0"
time=2025-06-11T14:29:23.875+01:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-06-11T14:29:23.876+01:00 level=INFO source=routes.go:1287 msg="Listening on [::]:11434 (version 0.9.0)"
time=2025-06-11T14:29:23.881+01:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-06-11T14:29:24.215+01:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-c28149a0-334e-3a75-f47f-896406db7a36 library=cuda variant=v12 compute=8.0 driver=12.9 name="NVIDIA A100 80GB PCIe" total="79.3 GiB" available="78.8 GiB"

This selects one GPU, but multiple could be selected if needed. You may also need to adjust the RAM requirements.

If you want to leave this running then you could use sbatch instead of srun to launch a background job. Or you could look at screen or tmux to keep your terminal running when you disconnect. Take care to shut it down when you're done though.

Running the client process¶

These steps need to be performed in a new terminal window. First you need to determine where the server is running:

tdb@hydra:~$ squeue --me
SUBMIT_TIME           JOBID      USER    ACCOUNT  PARTITION QOS      NAME                      ST        TIME  NODES CPUS MIN_CPUS MIN_MEMORY NODELIST(REASON) PRIORITY   TRES_PER_NODE
2025-06-11T14:29:22  424960      tdb       staff        gpu normal   apptainer                  R        4:22      1 1    1        1G         ampere03         23085      gres/gpu:1
tdb@hydra:~$

So in this case the server is ampere03, under the NODELIST column. You may have multiple jobs running, so check the submit time and name to find the right one.

Now we can connect to it by specifying the server name and running the container again. We can also specify which model to use, e.g. DeepSeek-R1.

tdb@hydra:~$ APPTAINERENV_OLLAMA_HOST=ampere03 apptainer run ollama.sif run deepseek-r1:8b
pulling manifest
pulling e6a7edc1a4d7: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 5.2 GB
pulling c5ad996bda6e: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  556 B
pulling 6e4c38e1172f: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 1.1 KB
pulling ed8474dc73db: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  179 B
pulling f64cd5418e4b: 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  487 B
verifying sha256 digest
writing manifest
success
>>> Send a message (/? for help)

It will now respond to queries:

>>> Hello, how are you?
Thinking...
Okay, the user greeted me with "Hello, how are you?" in English. This seems like a casual and friendly opening, probably looking for a warm interaction rather than immediate information.

I'm responding in kind to build rapport - matching their greeting tone helps create comfort. The phrasing keeps it simple yet personable since humans typically don't ask AI about our state
technically.

The user might be testing my conversational abilities or genuinely interested in social engagement before asking actual questions. No deep needs are apparent here as this is likely just an
initial contact attempt.
...done thinking.

Hello! I'm doing well, thank you for asking. How can I assist you today?

>>> Send a message (/? for help)

The Ollama Python Library may also be useful for interacting with the server.

Finishing up¶

When you're done, you can use /bye to exit the client:

>>> /bye
tdb@hydra:~$

To shut down the server simply press CTRL+c twice in the terminal with the server running. You should see something like:

[GIN] 2025/06/11 - 14:39:47 | 200 | 33.942735247s |    129.12.4.124 | POST     "/api/generate"
[GIN] 2025/06/11 - 14:41:14 | 200 | 17.670504218s |    129.12.4.124 | POST     "/api/chat"
[GIN] 2025/06/11 - 14:41:26 | 200 |  4.062454326s |    129.12.4.124 | POST     "/api/chat"
[GIN] 2025/06/11 - 14:41:29 | 200 |     108.145µs |    129.12.4.124 | HEAD     "/"
[GIN] 2025/06/11 - 14:41:29 | 200 |     41.0985ms |    129.12.4.124 | POST     "/api/show"
[GIN] 2025/06/11 - 14:41:29 | 200 |   19.809376ms |    129.12.4.124 | POST     "/api/generate"
[GIN] 2025/06/11 - 14:41:36 | 200 |  1.598879374s |    129.12.4.124 | POST     "/api/chat"
^Csrun: interrupt (one more within 1 sec to abort)
srun: StepId=424960.0 task 0: running
^Csrun: sending Ctrl-C to StepId=424960.0
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 424960.0 ON ampere03 CANCELLED AT 2025-06-11T14:45:17 ***
tdb@hydra:~$

And verify with squeue:

tdb@hydra:~$ squeue --me
SUBMIT_TIME           JOBID      USER    ACCOUNT  PARTITION QOS      NAME                      ST        TIME  NODES CPUS MIN_CPUS MIN_MEMORY NODELIST(REASON) PRIORITY   TRES_PER_NODE
tdb@hydra:~$

Next time you launch the server the process is the same, but the models that you downloaded will already be available.

As usual, please contact us with any queries.