Several users reported to encounter
"Error 804: forward compatibility was attempted on non supported HW" during the usage of some customized PyTorch docker images on our GPU cluster.
At first glance I recognized the culprit to be a version mismatch between installed driver on the host and required driver in the image. The corrupted images as they described were built targeting
CUDA == 11.3 with a corresponding driver version
== 465 , while some of our hosts are shipped with driver version
460. As a solution I told them to downgrade the targeting CUDA version by choosing a base image such as
nvidia/cuda:11.2.0-devel-ubuntu18.04, which indeed well solved the problem.
But later on I suspected the above hypothesis being the real cause. An observed counterexample was that another line of docker images targeting even higher CUDA version would run normally on those hosts, for example, the latest
ghcr.io/pytorch/pytorch:2.0.0-devel built for
CUDA == 11.7. This won’t be the case if CUDA version mismatch truly matters.
Afterwards I did a bit of research concerning the problem and learnt some interesting stuff which this post is going to share. In short, the recently released minor version compatibility allows applications built for newer CUDA to run on machines with some older drivers, but libnvidia-container doesn’t correcly handle it due to a bug and eventually leads to such an error.
Towards thorough comprehension, this post will first introduce the constitution of CUDA components, following with the compatibility policy of different components, and finally unravel the bug and devise a workaround for it. But before diving deep, I’ll give two Dockerfile samples to illustrate the problem.