0

I'm currently working with a K3s cluster and trying to set up GPU support using the NVIDIA device plugin. However, I'm encountering an issue where the plugin logs show:

E1004 13:11:10.866124       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1004 13:11:10.866139       1 main.go:346] No devices found. Waiting indefinitely.

Environment:

  • K3s Version: ( latest )
  • NVIDIA Driver Version: 560.31.01 NVIDIA
  • Container Toolkit Version: ( latest )
  • GPU Model: NVIDIA GeForce GTX 1660
  • Operating System: (WSL-Ubuntu 24.04 and Ubuntu 24.04)
  • Nodes: Single Master/Worker Node

Steps Taken: Verified GPU detection using nvidia-smi, which correctly shows the GPU:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.31.01              Driver Version: 560.81         CUDA Version: 12.6     |
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX 1660 ...    On  |   00000000:2B:00.0  On |                  N/A |
| 31%   50C    P0             37W /  125W |     642MiB /   6144MiB |      0%      Default |
+-----------------------------------------------------------------------------------------+

Confirmed the installation of the NVIDIA Container Toolkit:

dpkg -l | grep nvidia-container-runtime

Reviewed the config.toml for containerd located at /var/lib/rancher/k3s/agent/etc/containerd/config.toml

and it includes:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options] BinaryName = "/usr/bin/nvidia-container-runtime" SystemdCgroup = true

Added node labels

kubectl label node <node_name> nvidia.com/gpu.present=true

Restarted K3s service to apply changes:

sudo systemctl restart k3s

Checked logs for K3s:

sudo journalctl -u k3s -f

Checked the node for GPU availability

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

Checked logs for the NVIDIA device plugin DaemonSet:

kubectl logs -l app=nvidia-device-plugin-daemonset -n kube-system

Question: What steps should I take to resolve the issue of the NVIDIA device plugin not detecting GPUs in my K3s cluster? Are there any additional configurations I might be missing to ensure that the GPUs are recognized and usable by my pods?

0 Answers0