So far so good, if only the compiler in toolkit agrees on APIs with the targeting driver.
Sadly, that is not the norm. In real world, developers compile the programs on one machine and dispatch them to run on others, expecting those programs compiled by a specific version of CUDA toolkit could run on a wide variety of hardwares, or otherwise users would complain about the corrupted binaries.
Towards this guarantee, several compatibility policies are induced.
Before we introduce the policies, we should know about how the components are versioned. The CUDA toolkit and the drivers adopt different version schemes, with the toolkit versioned like 11.2
and drivers like 460.65
. Therefore, “driver 460.65” refers to the version of libcuda.so
and nvidia.ko
; similarly, when somebody says “CUDA 11.2”, it’s the toolkit version being mentioned.
NVIDIA devises multiple rules to ensure user binaries would work on a wide range of driver-hardware combinations, which can be grouped into two categories, i.e., toolkit-driver compatibility and UMD-KMD compatibility.
These policies constrain that binaries compiled by a specific CUDA toolkit can run on what version of driver.
Basically we have the “Backward Compatibility”. Each CUDA toolkit has a so-called toolkit driver version 3. Binaries compiled by that toolkit are guaranteed to run on drivers newer than the toolkit driver version. For example, the toolkit driver version of CUDA 11.2 is 460.27.03, which means binaries compiled by CUDA 11.2 should work on any driver >= 460.27.03. This is the most fundamental and agelong policy.
From CUDA 11 onwards, another policy named “Minor Version Compatibility” 4 was proposed. This policy allows binaries compiled by toolkits with the same major version to a the same driver version requirement. For example, binaries compiled by CUDA 11.0 would work on driver >= 450.36.06. Since CUDA 11.2 has the same major version with CUDA 11.0, binaries compiled by CUDA 11.2 could also work on driver >= 450.36.06 5.
The backward compatibility ensures compiled binaries would work on machines shipped with drivers of future version, while the minor version compatibility reduces the necessity of upgrading drivers to run some newly compiled binaries. Generally, a binary compiled by CUDA toolkit $X.Y$ should work with driver with version $M$, if either of the following satisfies:
However, the above policies only consider the relationship between CUDA toolkit and drivers. What if the user-mode and kernel-mode drivers have diverged version? This is where UMD-KMD compatibility applies.
In ideal case, kernel-mode driver should always work with user-mode driver with the same version. But upgrading kernel-mode drivers is sometimes tricky and troublesome, of which some users such as data center admins could not take the risk. Towards this consideration, NVIDIA devised the “Forward Compatibility” to allow old-versioned KMD to cooperate with new-versioned UMD under some circumstance.
Specifically, a kernel-mode driver would support all user-mode drivers releases during its lifetime. For instance, the driver 418.x has end of life (EOL) in March 2022, before which driver 460.x was released, then KMD 418.x would work with UMD 460.x. The compatibility does not involve anything at a higher level such as CUDA toolkit.
It’s worth noting that, this policy does not apply to all GPU hardwares but only a fraction of them. NVIDIA has limited forward compatibility to be applicable for systems with NVIDIA Data Center GPUs (the Tesla branch) or NGC Server Ready SKUs of RTX cards 6. If you own a GeForce RTX 3090, like in my scenario, you won’t enjoy this stuff.
Let’s make a quick review for the various types of compatibility policies. If you have a binary compiled by CUDA $X.Y$, a host with UMD (libcuda.so
) versioned $M$ and KMD (nvidia.ko
) versioned $M'$, then they would work fine if both of the two conditions hold:
Generally, validating the above conditions should help whenever you run in any compatibility problems.
So, what’s wrong with the docker image bad
? With above rules in hands we can perform a simple analysis.
Could it be a toolkit-driver incompatibility? Probably NO. According to Table. 1 here, the minor version compatibility applies with CUDA 11.x and driver >= 450.80.02, which our driver version 460 satisfies, let alone binary compiled by CUDA 11.7 working like a charm in the case of docker image good
.
It should be due to a KMD-UMD incompatibility, namely, the version of libcuda.so
and nvidia.ko
is incompatible. Since forward compatibility is not applicable for RTX 3090, we are expecting condition 1.2 holds, where libcuda.so
and nvidia.ko
should have the same version – this obviously was not the case.
A process in a container is technically a special process on the host, which shares the same model as other processes do to interact with GPU drivers. Since KMD runs in kernel and not interfered by user space, all programs regardless of on host or in containers are communicate with the same KMD.
By contrast, a program can flexibly choose which user-mode driver to link against. It can either link to the UMD installed along with the KMD on the host, or brings its own UMD during packaging and distribution.
We can list out all the UMDs in a running good
container with the command:
$ docker run --gpus='"device=0"' --rm -it --entrypoint= good bash
root@3a19f802a459:/workspace# find / -name 'libcuda.so*' -exec bash -c "echo {} -\> \`readlink {}\`" \; 2>/dev/null
/usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.460.67
/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.460.67 ->
Looks like there is only one copy of libcuda.so
that lies in /usr/lib/x86_64-linux-gnu/
with version 460.67. However, such libcuda.so
was not packed with the docker image from the beginning. The library disappears if you omit the --gpus
argument:
$ docker run --rm -it --entrypoint= good bash
root@3a19f802a459:/workspace# find / -name 'libcuda.so*' -exec bash -c "echo {} -\> \`readlink {}\`" \; 2>/dev/null
root@3a19f802a459:/workspace#
In fact, the library exists on the host and is injected into the container by docker runtime during the startup. This post demonstrates the injection process by viewing docker’s log. Mounting libcuda.so
from the host will maximally ensures the KMD-UMD correspondence aligned.
Now that the docker runtime would choose a native UMD, why did the image bad
fail?
bad
We can likewise check the UMDs in a running bad
container as belows:
$ docker run --gpus='"device=0"' --rm -it --entrypoint= bad bash
root@15f9b3c915b8:/# find / -name 'libcuda.so*' -exec bash -c "echo {} -\> \`readlink {}\`" \; 2>/dev/null
/usr/lib/x86_64-linux-gnu/libcuda.so.465.19.01 ->
/usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.465.19.01
/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.460.67 ->
/usr/local/cuda-11.3/compat/libcuda.so.465.19.01 ->
/usr/local/cuda-11.3/compat/libcuda.so.1 -> libcuda.so.465.19.01
/usr/local/cuda-11.3/compat/libcuda.so -> libcuda.so.1
/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs/libcuda.so ->
OOPS!!! Looks like there’s big difference here. We could derive two observations from the result:
libcuda.so
bundled inside the image at /usr/local/cuda-11.3/compat/libcuda.so.465.19.01
, with a higher version of 465.19.01
.libcuda.so.460.67
and the bundled libcuda.so.465.19.01
are symlinked under /usr/lib/x86_64-linux-gnu/
, and most importantly, it’s the bundled one being linked as libcuda.so
and chosen by the program.And that is the reason why the docker image bad
violates KMD-UMD compatibility!
Such misbehavior is a consequence of a bug of libnvidia-container
. But before we talk about it, let’s take a step back to see what the directory /usr/local/cuda-X/compat
does and why should it exist.
Actually the compat
directory is part of the CUDA compat package, according to the official docs, which exists to support the forward compatibility 7. The official base image nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
had this package built in, which contains a higher version UMD libcuda.so.465.19.01
in case of an older-versioned KMD running on the host. As aforementioned, to apply forward compatibility there exists requirement on the underlying hardware. When the requirement unsatisfied, such as for our RTX 3090 GPUs, the libcuda.so
from compat package should hopefully not be linked against.
Unfortunately, current release of nvidia-docker would roughly attempt to apply forward compatibility, regardless of whether the GPUs meet the limitation.
The problem was encountered and studied by Gemfield who posted an article PyTorch 的 CUDA 错误:Error 804: forward compatibility was attempted on non supported HW as explanation. Gemfield observed nvidia-docker would simultaneously symlink both the native UMD on host and the compat UMD in docker image under /usr/lib/x86_64-linux-gnu/
, and brutely choose the one with higher version as the libcuda.so.1
, against which user programs would link.
Obviously this behavior is neither in line with forward compatibility nor with minor version compatibility. Gemfield opened an issue NVIDIA/nvidia-docker#1515 for discussion, where the author guessed it was a bug of libnvidia-container and another issue NVIDIA/libnvidia-container#138 was referred. Both issues are not yet resolved up till now.
The workaround is simple – if there’s no compat package, the compat UMD won’t be applied. We can either remove the compat package or brutely delete the /usr/local/cuda-X/compat
directory to let it work:
FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
RUN apt update -y && apt install -y python3 python3-pip
RUN pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
RUN apt purge cuda-compat-11-3 -y
# OR
RUN rm -rfv /usr/local/cuda-11.3/compat/
ENTRYPOINT ["python", "-c", "import torch; print(torch.rand(2, 3).cuda())"]
FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
RUN apt update -y && apt install -y python3 python3-pip
RUN pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
ENTRYPOINT ["python", "-c", "import torch; print(torch.rand(2, 3).cuda())"]
FROM nvidia/cuda:11.3.0-cudnn8-devel-ubuntu20.04
RUN apt update -y && apt install -y python3 python3-pip
RUN pip install torch==1.12.1+cu113 --extra-index-url https://download.pytorch.org/whl/cu113
RUN apt purge cuda-compat-11-3 -y
# OR
RUN rm -rfv /usr/local/cuda-11.3/compat/
ENTRYPOINT ["python", "-c", "import torch; print(torch.rand(2, 3).cuda())"]
$ make bad
tensor([[0.0059, 0.6425, 0.2299],
[0.2306, 0.5954, 0.0226]], device='cuda:0')
This article elaborates the cause and workaround of CUDA Error 804 when NVIDIA GPUs working with docker. As preknowledge, I introduced the consistution of CUDA, the various categories of CUDA compatibility policies, and how the docker runtime deals with GPU driver. The culprit was discovered to be a bug or deficiency of libnvidia-container, which mishandled forward compatibility and minor version compatibility and was not yet resolved. As a workaround, one can remove the CUDA compat image inside the image to avoid forward compatibility being applied and light the minor version compatibility.
Author: hsfzxjy.
Link: .
License: CC BY-NC-ND 4.0.
All rights reserved by the author.
Commercial use of this post in any form is NOT permitted.
Non-commercial use of this post should be attributed with this block of text.
OOPS!
A comment box should be right here...But it was gone due to network issues :-(If you want to leave comments, make sure you have access to disqus.com.