1

I'm trying to get nvidia gpu drivers and related software installed / upgrades on a debian bullseye system and having trouble. I tried following the instructions for installing cuda, but when I get to step 13.2.1 "Install Persistence Daemon", it fails with the error:

nvidia-persistenced failed to initialize. Check syslog for more details.
logfile shows:
  Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files.

There are no nvidia files in /dev

/usr/local/ has the following:

$ ls -dl /usr/local/cuda*
lrwxrwxrwx  1 root root   22 Sep 30 20:15 /usr/local/cuda -> /etc/alternatives/cuda
drwxr-xr-x 16 root root 4096 Jun 16 16:35 /usr/local/cuda-11.3
lrwxrwxrwx  1 root root   25 Sep 30 20:15 /usr/local/cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 15 root root 4096 Sep 30 20:15 /usr/local/cuda-12.2
$ ls -dl /etc/alternatives/cuda*
lrwxrwxrwx 1 root root 20 Sep 30 20:15 /etc/alternatives/cuda -> /usr/local/cuda-12.2
lrwxrwxrwx 1 root root 20 Sep 30 20:15 /etc/alternatives/cuda-12 -> /usr/local/cuda-12.2

The gpu appears to be there:

 sudo nvidia-smi
Sat Sep 30 21:51:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              49W / 400W |      4MiB / 40960MiB |     26%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

When this GCE system was originally built there was a cuda-11 installation which worked, but I fear I've messed everything up and not sure how to proceed.

1 Answers1

0

It's not clear to me what was messed up, but I resolved it by completely removing the installed cuda and drivers, both independently and in the conda environment, and then re-installing. It may have been the result of the conda-installed stuff being an older version and not having removed it initially.

$ sudo apt-get --purge remove "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" "*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*"
$ sudo apt-get --purge remove "*nvidia*" "libxnvctrl*"
$ sudo /opt/conda/condabin/conda remove cuda
$ wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt-get update
$ sudo apt-get install cuda
  (got message about mis-matched drivers, suggesting reboot)
exit gce shell, stop vm, restart vm, bring up new shell
$ export PATH=/usr/local/cuda-12.2/bin${PATH:+:${PATH}}
$ git clone https://github.com/nvidia/cuda-samples
continue with installation verification by building and running samples