I'm currently working with a K3s cluster and trying to set up GPU support using the NVIDIA device plugin. However, I'm encountering an issue where the plugin logs show:
E1004 13:11:10.866124 1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1004 13:11:10.866139 1 main.go:346] No devices found. Waiting indefinitely.
Environment:
- K3s Version: ( latest )
- NVIDIA Driver Version: 560.31.01 NVIDIA
- Container Toolkit Version: ( latest )
- GPU Model: NVIDIA GeForce GTX 1660
- Operating System: (WSL-Ubuntu 24.04 and Ubuntu 24.04)
- Nodes: Single Master/Worker Node
Steps Taken: Verified GPU detection using nvidia-smi, which correctly shows the GPU:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.31.01 Driver Version: 560.81 CUDA Version: 12.6 |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce GTX 1660 ... On | 00000000:2B:00.0 On | N/A |
| 31% 50C P0 37W / 125W | 642MiB / 6144MiB | 0% Default |
+-----------------------------------------------------------------------------------------+
Confirmed the installation of the NVIDIA Container Toolkit:
dpkg -l | grep nvidia-container-runtime
Reviewed the config.toml for containerd located at /var/lib/rancher/k3s/agent/etc/containerd/config.toml
and it includes:
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
Added node labels
kubectl label node <node_name> nvidia.com/gpu.present=true
Restarted K3s service to apply changes:
sudo systemctl restart k3s
Checked logs for K3s:
sudo journalctl -u k3s -f
Checked the node for GPU availability
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
Checked logs for the NVIDIA device plugin DaemonSet:
kubectl logs -l app=nvidia-device-plugin-daemonset -n kube-system
Question: What steps should I take to resolve the issue of the NVIDIA device plugin not detecting GPUs in my K3s cluster? Are there any additional configurations I might be missing to ensure that the GPUs are recognized and usable by my pods?