Hydra Cluster: TensorFlow¶

Introduction¶

This document walks through a basic example of using TensorFlow on the Hydra cluster. You should read the main documentation first if you have not already done so. This is based around the Virtualenv example given in the TensorFlow documentation.

Installing TensorFlow¶

If you need TensorFlow 1 see below for an alternative installation method.

Starting out on the hydra server, or in your cluster home directory on myrtle or raptor, you can create a new virtual environment. In this case, we'll call the directory tensorflow and use Python 3.

tdb@hydra:~$ python3 -m venv --system-site-packages tensorflow
tdb@hydra:~$

Now activate the environment. Notice the prompt changes to indicate you're inside the new virtual environment.

tdb@hydra:~$ source tensorflow/bin/activate
(tensorflow) tdb@hydra:~$

Now install pip and then TensorFlow. Output is trimmed here for brevity. Please be patient - this step can take a while.

(tensorflow) tdb@hydra:~$ pip install --upgrade pip
Collecting pip
  Using cached pip-21.2.4-py3-none-any.whl (1.6 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 20.0.2
    Uninstalling pip-20.0.2:
      Successfully uninstalled pip-20.0.2
Successfully installed pip-21.2.4
(tensorflow) tdb@hydra:~$

(tensorflow) tdb@hydra:~$ pip install --ignore-installed --upgrade tensorflow
Collecting tensorflow
  Using cached tensorflow-2.6.0-cp38-cp38-manylinux2010_x86_64.whl (458.4 MB)
...
Successfully installed absl-py-0.14.1 astunparse-1.6.3 cachetools-4.2.4 certifi-2021.10.8 charset-normalizer-2.0.6 clang-5.0 flatbuffers-1.12 gast-0.4.0 google-auth-1.35.0 google-auth-oauthlib-0.4.6 google-pasta-0.2.0 grpcio-1.41.0 h5py-3.1.0 idna-3.2 keras-2.6.0 keras-preprocessing-1.1.2 markdown-3.3.4 numpy-1.19.5 oauthlib-3.1.1 opt-einsum-3.3.0 protobuf-3.18.1 pyasn1-0.4.8 pyasn1-modules-0.2.8 requests-2.26.0 requests-oauthlib-1.3.0 rsa-4.7.2 setuptools-58.2.0 six-1.15.0 tensorboard-2.6.0 tensorboard-data-server-0.6.1 tensorboard-plugin-wit-1.8.0 tensorflow-2.6.0 tensorflow-estimator-2.6.0 termcolor-1.1.0 typing-extensions-3.7.4.3 urllib3-1.26.7 werkzeug-2.0.2 wheel-0.37.0 wrapt-1.12.1
(tensorflow) tdb@hydra:~$

If you're using this example as a starting point for your own code you can install additional Python packages within this virtual environment as required.

Testing TensorFlow¶

To test TensorFlow we'll create a short program taken from the TensorFlow documentation. We'll also create a shell script to configure the environment and run it. Use a text editor to create the files shown below, obviously substituting your own home directory in the second file.

(tensorflow) tdb@hydra:~$ cat tftest.py
import tensorflow as tf
print(tf.reduce_sum(tf.random.normal([1000, 1000])))

(tensorflow) tdb@hydra:~$ cat tftest.sh
#!/bin/sh

. /cluster/home/cur/tdb/tensorflow/bin/activate
python /cluster/home/cur/tdb/tftest.py

(tensorflow) tdb@hydra:~$

Now we can try running it on the Hydra cluster.

(tensorflow) tdb@hydra:~$ srun -p gpu --gres gpu:1 ./tftest.sh
2021-10-09 11:51:55.982457: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-09 11:52:02.023876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15405 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 6.0
tf.Tensor(-872.1127, shape=(), dtype=float32)
(tensorflow) tdb@hydra:~$

Or we can submit it as a batch job.

(tensorflow) tdb@hydra:~$ sbatch -p gpu --gres gpu:1 ./tftest.sh
Submitted batch job 141658
(tensorflow) tdb@hydra:~$

(tensorflow) tdb@hydra:~$ cat slurm-141658.out
2021-10-09 11:52:56.659022: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-09 11:52:57.191003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15405 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:3b:00.0, compute capability: 6.0
tf.Tensor(-760.46423, shape=(), dtype=float32)
(tensorflow) tdb@hydra:~$

Alternative installation method for TensorFlow 1¶

We have had some users who need to use TensorFlow version 1, rather than version 2 which the above instructions would install. The problem is that the normal releases are built against older CUDA libraries, and those libraries don't support the newer GPUs that we have.

Fortunately NVIDIA package their own build of TensorFlow version 1 using the newer CUDA libraries. To install this library you would use the following pip commands instead of the ones above to install TensorFlow. Output is trimmed here for brevity.

(tensorflow) tdb@hydra:~$ pip install nvidia-pyindex
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
Building wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... done
  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8398 sha256=4b7373efedcb073ece3fd299f1c440a5b299959aacc15d9419aa0f6ba2b339f1
  Stored in directory: /tmp/pip-ephem-wheel-cache-7yfuqdmm/wheels/e0/c2/fb/5cf4e1cfaf28007238362cb746fb38fc2dd76348331a748d54
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.9
(tensorflow) tdb@hydra:~$

(tensorflow) tdb@hydra:~$ pip install nvidia-tensorflow[horovod]
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
/usr/share/python-wheels/urllib3-1.25.8-py2.py3-none-any.whl/urllib3/connectionpool.py:999: InsecureRequestWarning: Unverified HTTPS request is being made to host 'pypi.ngc.nvidia.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
Collecting nvidia-tensorflow[horovod]
  Downloading https://developer.download.nvidia.com/compute/redist/nvidia-tensorflow/nvidia_tensorflow-1.15.5%2Bnv22.02-3927706-cp38-cp38-linux_x86_64.whl (818.8 MB)
...
Successfully installed absl-py-1.0.0 astor-0.8.1 astunparse-1.6.3 cloudpickle-2.0.0 gast-0.3.3 google-pasta-0.2.0 grpcio-1.44.0 h5py-2.10.0 keras-applications-1.0.8 keras-preprocessing-1.1.2 numpy-1.18.5 nvidia-cublas-cu116-11.8.1.74 nvidia-cuda-cupti-cu116-11.6.55 nvidia-cuda-nvcc-cu116-11.6.55 nvidia-cuda-runtime-cu116-11.6.55 nvidia-cudnn-cu115-8.3.2.44 nvidia-cufft-cu116-10.7.0.55 nvidia-curand-cu116-10.2.9.55 nvidia-cusolver-cu116-11.3.2.55 nvidia-cusparse-cu116-11.7.1.55 nvidia-dali-cuda110-1.10.0 nvidia-dali-nvtf-plugin-1.10.0+nv22.2 nvidia-horovod-0.23.0+nv22.2 nvidia-nccl-cu116-2.11.4 nvidia-tensorflow-1.15.5+nv22.2 opt-einsum-3.3.0 psutil-5.9.0 tensorboard-1.15.0 tensorflow-estimator-1.15.1 termcolor-1.1.0 werkzeug-2.0.3 wrapt-1.13.3
(tensorflow) tdb@hydra:~$

The above code for testing TensorFlow won't work because it's specific to TensorFlow version 2. The principle is the same, but tftest.py should contain something like this instead:

(tensorflow) tdb@hydra:~$ cat tftest.py
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))

(tensorflow) tdb@hydra:~$

Using containers¶

We've installed Apptainer on Hydra to allow the use of containers. Please see this documentation for further information, specifically the section on using NVIDIA's prebuilt TensorFlow container.

Using anaconda¶

An alternative approach to using Python environments is to use Anaconda. We don't currently have documentation on doing this, but it may be an option to explore if the Python environments don't work for your needs.

Take note when running on CPUs, rather than GPUs¶

We've had issues reported when running TensorFlow on older CPUs without the AVX instruction set. If you're using the gpu partition then you're fine, but if you are using the cpu partition you should use the -C avx flag to make sure you get only those machines with the newer CPUs.

As usual, please contact us with any queries.