Container NVIDIA GPU
This document analyzes techniques for allocating NVIDIA GPUs to containers.
1. Container NVIDIA GPU
1.1. NVIDIA GPU Container Architecture
![[Figure 1] NVIDIA GPU Container Architecture](/blog-software/docs/theory-analysis/container-nvidia-gpu/images/gpu-container-legacy-mode-architecture.png)
[Figure 1] NVIDIA GPU Container Architecture
NVIDIA GPUs can be allocated to containers to enable containers to use NVIDIA GPUs. Docker version 19.03 introduced the GPU option (--gpu) that can be used to allocate NVIDIA GPUs to containers. [Figure 1] shows containers with NVIDIA GPU configuration completed through the GPU option. Each container can use not only one GPU but also multiple GPUs. Additionally, containers can use not only dedicated GPUs allocated exclusively to them but also shared GPUs shared with other containers.
Container A uses NVIDIA GPUs 0 and 1, Container B uses NVIDIA GPUs 0, 2, and 3, and Container C uses NVIDIA GPU 3. GPUs 0, 1, and 3 are used as shared GPUs, while GPU 2 is used as a dedicated GPU. Each container can be created using the following Docker commands. You can pass a single GPU number or multiple GPU numbers separated by , to the --gpu option.
- Container A :
docker run --gpu 0,1 --name a nvidia/cuda:12.4-base-ubuntu22.04 - Container B :
docker run --gpu 0,2,3 --name b nvidia/cuda:12.4-base-ubuntu22.04 - Container C :
docker run --gpu 3 --name c nvidia/cuda:12.4-base-ubuntu22.04
When --gpu all is configured, the container can use all NVIDIA GPUs. Shared GPUs utilize Time-Slicing or MPS (Multi Process Service) features provided by NVIDIA GPUs, allowing multiple containers to share and use a single NVIDIA GPU.
To allocate GPUs to containers, the NVIDIA Container Toolkit must be utilized. The NVIDIA Container Toolkit provides nvidia-container-runtime, nvidia-container-runtime-hook, and nvidia-container-cli CLIs, and the container runtime allocates GPUs to containers through the provided CLIs. The NVIDIA Container Toolkit performs the following roles:
- Injects device files (
/dev/nvidiaX,/dev/nvidia-uvm,/dev/nvidia-uvm-tools,/dev/nvidiactl) into the container through bind mount so that applications inside the container can access the GPUs allocated to the container. - Injects CUDA libraries/tools into the container through bind mount so that applications inside the container can utilize the GPUs allocated to the container.
- Configures cgroups so that GPUs allocated to the container can be accessed from inside the container.
- Sets the
NVIDIA_VISIBLE_DEVICESenvironment variable so that applications and CUDA libraries/tools inside the container can recognize the GPUs allocated to the container.
1.2. NVIDIA GPU Allocation Process
To allocate GPUs to containers, containerd must be configured to execute the nvidia-container-runtime CLI instead of runc CLI. The nvidia-container-runtime CLI injects additional settings for GPU allocation into the OCI Runtime Spec created by containerd and then executes the runc CLI. That is, it performs the role of modifying the OCI Runtime Spec for GPU allocation between containerd and the runc CLI. After modifying the OCI Runtime Spec and executing the runc CLI, the nvidia-container-runtime CLI terminates.
| |
[File 1] shows how to configure containerd to execute the nvidia-container-runtime CLI. You can see that the default_runtime_name parameter is set to nvidia, and the spec file required for the nvidia runtime and the path to the nvidia-container-runtime CLI are configured.
| |
The nvidia-container-runtime CLI has two modes: Legacy Mode, which utilizes the OCI Runtime Spec’s Prestart Hook feature, and CDI Mode, which utilizes CDI (Container Device Interface). [File 2] shows the configuration file for nvidia-container-runtime to set the mode. You can set mode to one of legacy, cdi, or auto. legacy is the existing method that utilizes the OCI Runtime Spec’s Prestart Hook feature, and cdi is the latest method that utilizes CDI (Container Device Interface). auto automatically selects one of legacy or cdi based on system configuration, and if CDI spec files exist in spec-dirs, it uses CDI mode; otherwise, it uses Legacy mode.
1.2.1. GPU Allocation Process in Legacy Mode
![[Figure 3] NVIDIA GPU Container Init in Legacy Mode](/blog-software/docs/theory-analysis/container-nvidia-gpu/images/gpu-container-init-legacy.png)
[Figure 3] NVIDIA GPU Container Init in Legacy Mode
[Figure 1] precisely shows the Architecture of Legacy Mode. [Figure 3] shows the GPU allocation process in Legacy Mode. The GPU allocation process in Legacy Mode is a method that actively utilizes the OCI Runtime Spec’s Prestart Hook feature. Prestart Hook refers to a command that runs before the container’s entrypoint command is executed. [Figure 3] performs the following process:
- The
ctrCLI ordockerdpasses a container creation request tocontainerdalong with GPU information to allocate to the container set in theNVIDIA_VISIBLE_DEVICESenvironment variable. containerdcreates an OCI Runtime Config file according to the request.- Subsequently,
containerdexecutes thenvidia-container-runtimeCLI according tocontainerd’s config. - The
nvidia-container-runtimeCLI adds settings to execute thenvidia-container-runtime-hookCLI in the Prestart Hook of the OCI Runtime Config file. - Subsequently, the
nvidia-container-runtimeCLI executes theruncCLI to create the container. - The
runcCLI creates a Container Init Process through theclone()system call to set up namespaces and root filesystem for the container according to the OCI Runtime Spec, and prepares the root filesystem using OverlayFS. - The Container Init Process created through the
clone()system call requests theruncCLI to execute the Prestart Hook through a FIFO named pipe. - The
runcCLI executes thenvidia-container-runtime-hookCLI according to the OCI Runtime Spec. - The
nvidia-container-runtime-hookCLI passes container information and NVIDIA GPU information (NVIDIA_VISIBLE_DEVICESenvironment variable) to configure in the container as parameters to thenvidia-container-cliCLI based on the OCI Runtime Spec file and executes thenvidia-container-cliCLI. Thenvidia-container-cliCLI is the entity that actually configures the container’s NVIDIA GPU, while thenvidia-container-runtime-hookCLI only performs the role of an interface connecting the OCI Runtime Spec and thenvidia-container-cliCLI. - The
nvidia-container-cliCLI injects device files and CUDA libraries/tools into the container through bind mount via the internallibnvidia-containerlibrary based on the received container and NVIDIA GPU information to configure in the container. It also configures cgroups to allow GPU access from inside the container and, if necessary, loads kernel modules required for NVIDIA GPU operation. - When the Prestart Hook work is completed, the
runcCLI sends a completion response to the Container Init Process through the FIFO named pipe. - The Container Init Process that received the Prestart Hook completion changes the container’s root filesystem to the actual container root filesystem through the
pivot_root()system call and executes the container’s entrypoint command through theexec()system call.
| |
[File 3] shows an example of the OCI Runtime Spec. When containerd receives a container creation request from the ctr CLI or dockerd, it sets the NVIDIA_VISIBLE_DEVICES environment variable, so it creates the OCI Runtime Spec with the NVIDIA_VISIBLE_DEVICES environment variable added to the environment variables. Subsequently, settings to execute the nvidia-container-runtime-hook CLI are added to the Prestart Hook of the OCI Runtime Spec by nvidia-container-runtime.
The nvidia-container-runtime CLI adds settings to execute the nvidia-container-runtime-hook CLI in the Prestart Hook of the OCI Runtime Spec. The argument for the nvidia-container-runtime-hook CLI is fixed to prestart, and NVIDIA GPU and CUDA-related environment variables are added to the container’s environment variables. For example, the NVIDIA_VISIBLE_DEVICES environment variable specifies which NVIDIA GPUs will be exposed to the container. In [Figure 2], since Docker was configured to use all NVIDIA GPUs with --gpu all, the NVIDIA_VISIBLE_DEVICES environment variable is also set to all.
# Check bind mount
$ mount
...
tmpfs on /proc/driver/nvidia type tmpfs (rw,nosuid,nodev,noexec,relatime,seclabel,mode=555)
/dev/nvme0n1p1 on /usr/bin/nvidia-smi type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-debugdump type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-persistenced type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nv-fabricmanager type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-cuda-mps-control type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-cuda-mps-server type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-ml.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-cfg.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libcuda.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libcudadebugger.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-opencl.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-gpucomp.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-ptxjitcompiler.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-allocator.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-pkcs11.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-pkcs11-openssl3.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-nvvm.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /lib/firmware/nvidia/580.126.09/gsp_ga10x.bin type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /lib/firmware/nvidia/580.126.09/gsp_tu10x.bin type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
tmpfs on /run/nvidia-persistenced/socket type tmpfs (rw,nosuid,nodev,noexec,seclabel,size=38119072k,nr_inodes=819200,mode=755)
devtmpfs on /dev/nvidiactl type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
devtmpfs on /dev/nvidia-uvm type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
devtmpfs on /dev/nvidia-uvm-tools type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
devtmpfs on /dev/nvidia2 type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
proc on /proc/driver/nvidia/gpus/0000:3c:00.0 type proc (ro,nosuid,nodev,noexec,relatime)
devtmpfs on /dev/nvidia3 type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
proc on /proc/driver/nvidia/gpus/0000:3e:00.0 type proc (ro,nosuid,nodev,noexec,relatime)
devtmpfs on /dev/nvidia1 type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
proc on /proc/driver/nvidia/gpus/0000:3a:00.0 type proc (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
devtmpfs on /dev/nvidia0 type devtmpfs (ro,nosuid,noexec,seclabel,size=4096k,nr_inodes=23820204,mode=755)
proc on /proc/driver/nvidia/gpus/0000:38:00.0 type proc (ro,nosuid,nodev,noexec,relatime)
...
# Check device files
$ ls -l /dev
...
crw-rw-rw- 1 root root 510, 0 Feb 11 15:41 nvidia-uvm
crw-rw-rw- 1 root root 510, 1 Feb 11 15:41 nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Feb 11 15:41 nvidia0
crw-rw-rw- 1 root root 195, 1 Feb 11 15:41 nvidia1
crw-rw-rw- 1 root root 195, 2 Feb 11 15:41 nvidia2
crw-rw-rw- 1 root root 195, 3 Feb 11 15:41 nvidia3
crw-rw-rw- 1 root root 195, 255 Feb 11 15:41 nvidiactl
...
# Check environment variable
$ printenv | grep NVIDIA
NVIDIA_VISIBLE_DEVICES=all
...$ ls -l /sys/fs/cgroup/[container_id]/devices/
...
c 195:255 rw # /dev/nvidiactl
c 244:0 rw # /dev/nvidia-uvm
c 244:1 rw # /dev/nvidia-uvm-tools
c 195:3 rw # /dev/nvidia3
c 195:2 rw # /dev/nvidia2
c 195:1 rw # /dev/nvidia1
c 195:0 rw # /dev/nvidia0
...[Shell 1] shows an example of checking bind mounts, device files, and environment variables inside a container. You can see that GPU device files and CUDA libraries/tools are injected into the container through bind mount via the mount command. You can also see that the NVIDIA_VISIBLE_DEVICES environment variable is injected into the container application through the printenv command. [Shell 2] shows an example of the device list in the cgroup for allowing GPU access from inside the container.
1.2.2. GPU Allocation Process in CDI Mode
![[Figure 4] NVIDIA GPU Container CDI Mode Architecture](/blog-software/docs/theory-analysis/container-nvidia-gpu/images/gpu-container-cdi-mode-architecture.png)
[Figure 4] NVIDIA GPU Container CDI Mode Architecture
$ mount
...
tmpfs on /run/nvidia-persistenced/socket type tmpfs (rw,nosuid,nodev,noexec,seclabel,size=3146936k,nr_inodes=819200,mode=755)
/dev/nvme0n1p1 on /usr/bin/nvidia-cuda-mps-control type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-cuda-mps-server type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-debugdump type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-imex type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-imex-ctl type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-persistenced type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/bin/nvidia-smi type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libEGL_nvidia.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libGLESv1_CM_nvidia.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libGLESv2_nvidia.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libGLX_nvidia.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libcuda.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libcudadebugger.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvcuvid.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-allocator.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-cfg.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-egl-gbm.so.1.1.3 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-egl-wayland.so.1.1.20 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-egl-wayland.so.1.1.21 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-eglcore.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-encode.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-fbc.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-glcore.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-glsi.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-glvkspirv.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-gpucomp.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-gtk2.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-gtk3.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-ml.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-ngx.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-nvvm.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-opencl.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-opticalflow.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-pkcs11-openssl3.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-pkcs11.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-present.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-ptxjitcompiler.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-rtcore.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-sandboxutils.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-tls.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-vksc-core.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvidia-wayland-client.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/libnvoptix.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /etc/X11/xorg.conf.d/10-nvidia.conf type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /etc/vulkan/icd.d/nvidia_icd.json type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /etc/vulkan/icd.d/nvidia_icd.x86_64.json type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /etc/vulkan/implicit_layer.d/nvidia_layers.json type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/vdpau/libvdpau_nvidia.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/share/nvidia/nvoptix.bin type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,seclabel,size=14717844k,noswap)
/dev/nvme0n1p1 on /lib/firmware/nvidia/580.126.09/gsp_ga10x.bin type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /lib/firmware/nvidia/580.126.09/gsp_tu10x.bin type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/share/glvnd/egl_vendor.d/10_nvidia.json type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/xorg/modules/drivers/nvidia_drv.so type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
/dev/nvme0n1p1 on /usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.580.126.09 type xfs (ro,nosuid,nodev,noatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,sunit=1024,swidth=1024,noquota)
tmpfs on /run/nvidia-ctk-hook358549ab-23fa-497c-a9e4-9b9c0a2250ff type tmpfs (rw,relatime,seclabel,size=4k)
tmpfs on /proc/driver/nvidia/params type tmpfs (rw,relatime,seclabel,size=4k)
... | |
CDI (Container Device Interface) is an interface for spec files to allocate special devices such as GPUs and NPUs to containers. [File 4] shows an example of a CDI-compliant spec file for NVIDIA GPUs. CDI spec files are written in YAML format, and you can see that they include device files and bind mount information for CUDA libraries/tools to allocate GPUs to containers. NVIDIA CDI spec files can be generated using the nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml command.
[Figure 4] shows the Architecture of CDI Mode. The biggest difference between CDI Mode and Legacy Mode is that CDI Mode does not inject device files through bind mount, but instead utilizes a method where the runc CLI directly creates device files. Therefore, when checking [Shell 3] and mount information, you can see that only CUDA libraries/tools are bind mounted, and device files are not bind mounted.
![[Figure 5] NVIDIA GPU Container Init in CDI Mode](/blog-software/docs/theory-analysis/container-nvidia-gpu/images/gpu-container-init-cdi.png)
[Figure 5] NVIDIA GPU Container Init in CDI Mode
[Figure 5] shows the GPU allocation process in CDI Mode. When the nvidia-container-runtime CLI operates in CDI Mode, it reads CDI spec files to check GPU information and injects GPU device files and bind mount information for CUDA libraries/tools into the OCI Runtime Spec according to the NVIDIA_VISIBLE_DEVICES environment variable to create the container. Since the nvidia-container-runtime CLI directly modifies the OCI Runtime Spec for GPUs, containers can be created without prestart hooks.
| |
[File 6] shows an example of the OCI Runtime Spec with related settings injected by the nvidia-container-runtime CLI in CDI Mode to allocate GPUs. You can see that GPU device files and bind mount information for CUDA libraries/tools have been injected.
2. References
- https://devblogs.nvidia.com/gpu-containers-runtime/
- https://gitlab.com/nvidia/container-toolkit/toolkit
- https://gitlab.com/nvidia/container-toolkit/libnvidia-container/
- https://github.com/opencontainers/runtime-spec/blob/master/config.md#posix-platform-hooks
- https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overview.pdf