Today a user of our GPU cluster ran into a problem where executing python -c 'import torch; torch.tensor(1).cuda() would hang forever and could not be killed. The problem occured on a rather old Docker image (with torch == 0.4.0), and would disappear if newer images were used. It was caused by some far less known coincidents, which surprised me and I want to share in this post.

The Problem

The hanging program is spawned by following command:

/usr/bin/docker run --rm -u 1457:1457 \
--gpus '"device='0,1,2,3'"' \
-v /ghome/username:/ghome/username -v /gdata/username:/gdata/username \
-it --ipc=host --shm-size 64G \
-v /gdata1/username:/gdata1/username -v /gdata2/username:/gdata2/username \
-e HOME=/ghome/username \
-m 48G --memory-swap 48G --cpus 5 \
--name username2 \
bit:5000/deepo_9 \
python3 -c 'import torch; torch.tensor(1).cuda()'

the Docker image bit:5000/deepo_9 he used was built with CUDA-9, while the host has multiple 1080Ti GPU cards and CUDA upgraded to 11.4. Looks like there’s some binary incompatibility, considering the fact that the problem would gone with newer images.

Step 1: Find out the incorrect arguments

But firstly, I have to confirm that there is no mis-configuration. Programs stuck from time to time on our cluster, and some were caused by mis-configuration from users, such as incorrect docker run arguments. I then decided to take a try on a most simplified version of command

/usr/bin/docker run --rm -u 1457:1457 \
--gpus '"device='0,1,2,3'"' \
--name username2 \
bit:5000/deepo_9 \
python3 -c 'import torch; torch.tensor(1).cuda()'

This command worked just fine, which suggests that the problem is from some combination of other arguments.

I should find out what the combination is. For which I progressively added back the arguments, one at a time. And finally it turned out that -v /ghome/username:/ghome/username and -e HOME=/ghome/username mutually caused the stuck.

Step 2: Find out the trapped IO

The above finding offers two hints:

  1. Something performs IO under $HOME;
  2. Such IO works well with default setting HOME=/, but stucks with HOME=/ghome/username.

So what’s different between / and /ghome/username? If you write under both directories, the previous one would be resolved into the container FS layer, while the latter to an external volume, with NFS as underlying FS. It might be a special IO operation that differs on the two file systems.

To find out the operation, I attached strace to the python process, which would spy and print out all syscalls the program invoked. The logs rolled by and eventually stopped at following lines

open("/ghome/username/.nv/ComputeCache/index", O_RDWR) = 30
clock_gettime(CLOCK_MONOTONIC_RAW, {tv_sec=551059, tv_nsec=515612618}) = 0
fcntl(30, F_SETLK, {l_type=F_RDLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = ?
+++ killed by SIGKILL +++

Now we know the culprit is an flock attempt on file /ghome/username/.nv/ComputeCache/index, residing on NFS. NFS does not co-operate well with file locks, which typically shows up as hanging a program.

Step 3: Who’s performing the IO?

Now it’s a bit weird. The python process executes merely nothing, and I cannot figure out where the IO came from. I hence typed .nv/ComputeCache in Google inquiring for answer. The first entry popped up was CUDA Pro Tip: Understand Fat Binaries and JIT Caching, where I read the following

nvcc, the CUDA compiler driver, uses a two-stage compilation model. The first stage compiles source device code to PTX virtual assembly, and the second stage compiles the PTX to binary code for the target architecture. The CUDA driver can execute the second stage compilation at run time, compiling the PTX virtual assembly “Just In Time” to run it. This JIT compilation can cause delay at application start-up time (or more accurately, CUDA context creation time). CUDA uses two approaches to mitigate start-up overhead on JIT compilation: fat binaries and JIT caching.

[…]

CUDA_CACHE_PATH specifies the directory location of compute cache files; the default values are:

  • […]
  • on Linux, ~/.nv/ComputeCache

It turns out that CUDA runtime itself is the executer. Either no fat binaries found or fat binaries incompatibility in PyTorch triggers the second approach JIT caching. The docs also mentions a potential problem (similar to what we have met) and corresponding solution

Cache stored on a Slow Network Share

On Linux, the default location of the CUDA JIT cache is in your home directory. On clusters, it is not uncommon to mount home directories with relatively poor performance to the compute nodes (by using the Lustre file system for scratch space, but only NFS for the home directory, for example). We have seen cases where this relatively slow connection to the home directory (and thus the JIT cache) resulted in very long application start-up times when the application was not built with code for the right SM version. Even more confusing, start-up time can vary from node to node due to intricacies of the NFS set up.

In this situation, it is best to build the application to avoid JIT entirely, and alternatively, to set CUDA_CACHE_PATH to point to a location on a fast file system.

Epilogue

So it was actually a less known (at least by me) feature of CUDA runtime, coincides with a rare usecase (NFS as home directory). Surprising results, but also teaches me some interesting facts beyond the user space boundary.

I’ve tried to figure out why newer images just works well. They won’t trigger the JIT caching, which suggests pre-compiled fat binaries are compatible with the current arch. I compared the binaries by dumping information using cuobjdump, but noticed nothing (or just I am not familiar with the stuff). Or maybe I should start by comparing the compilation flags of the two PyTorch versions. I don’t have the time, so I give up.

Oh, and lastly, strace is always our friend.