Today a user of our GPU cluster ran into a problem where executing python -c 'import torch; torch.tensor(1).cuda()
would hang forever and could not be killed. The problem occured on a rather old Docker image (with torch == 0.4.0
), and would disappear if newer images were used. It was caused by some far less known coincidents, which surprised me and I want to share in this post.
The Problem
The hanging program is spawned by following command:
/usr/bin/docker run --rm -u 1457:1457 \
--gpus '"device='0,1,2,3'"' \
-v /ghome/username:/ghome/username -v /gdata/username:/gdata/username \
-it --ipc=host --shm-size 64G \
-v /gdata1/username:/gdata1/username -v /gdata2/username:/gdata2/username \
-e HOME=/ghome/username \
-m 48G --memory-swap 48G --cpus 5 \
--name username2 \
bit:5000/deepo_9 \
python3 -c 'import torch; torch.tensor(1).cuda()'
the Docker image bit:5000/deepo_9
he used was built with CUDA-9, while the host has multiple 1080Ti GPU cards and CUDA upgraded to 11.4. Looks like there’s some binary incompatibility, considering the fact that the problem would gone with newer images.