We have an IBM POWER AC922 server that’s misbehaving, and we’re experiencing issues that we initially suspected were related to the GPUs. Specifically, a kernel panic sometimes occurs when attempting to run CUDA code.
After investigating the system, we concluded that GPU #1 might be damaged. We removed it from the system using the following command: echo 1 > /sys/bus/pci/devices/0004:05:00.0/remove, which seemed to reduce the frequency of crashes.
However, it turns out our assumption was incorrect. Although the system crashes less frequently, the issue persists, and we are unable to use the other GPUs. Running even simple CUDA code results in a GPU initialization error.
There are no warnings or errors in OpenBMC, and nothing unusual appears in the logs. The system remains stable when using the CPU—I’ve run stress-ng for 6 hours without any crashes, though I’m unsure if stress-ng is relevant for this type of issue.
I’m considering the possibility that this might be due to faulty RAM, as the crashes are random and the dump outputs vary. Unfortunately, I haven’t been able to find a Memtest-like software for ppc64le architecture that could help diagnose this.
The machine is out of warranty, so contacting the vendor isn’t an option. Firmware upgrade was done to also mitigate the issue but without success.
The question: What are my options for hardware diagnostics on the IBM POWER AC922?
PS: Additional info.
The GPUs:
0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0004:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
A recent crash (you can see the nv_open_device):
[ 2423.753147] Unable to handle kernel paging request for data at address 0x00000c70
[ 2423.753172] Faulting instruction address: 0xc0000000000189c8
[ 2423.753185] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2423.753195] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2423.753208] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache 8021q garp mrp stp llc bonding nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nf_tables_set nvidia(POE) nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink i2c_dev mlx5_ib ib_uverbs xts ib_core vmx_crypto ipmi_powernv ofpart ipmi_devintf powernv_flash ipmi_msghandler mtd ibmpowernv opal_prd at24 uio_pdrv_genirq uio auth_rpcgss sunrpc xfs raid1 sd_mod t10_pi sg mlx5_core bnx2x ast drm_shmem_helper i2c_algo_bit drm_kms_helper ahci libahci syscopyarea sysfillrect sysimgblt drm libata tg3 mlxfw mdio libcrc32c tls psample drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
[ 2423.753446] CPU: 24 PID: 4335 Comm: test_compatibil Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.16.1.el8_10.ppc64le #1
[ 2423.753494] NIP: c0000000000189c8 LR: c000000000018990 CTR: c0000000001d33e0
[ 2423.753525] REGS: c00000016342b190 TRAP: 0300 Tainted: P OE -------- - - (4.18.0-553.16.1.el8_10.ppc64le)
[ 2423.753569] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24008202 XER: 20040000
[ 2423.753608] CFAR: c0000000000189a4 DAR: 0000000000000c70 DSISR: 40000000 IRQMASK: 1
GPR00: c000000000018990 c00000016342b420 c00000000220f200 c000000113973000
GPR04: c00000012b3b6c60 c00000012b3b6000 0000000024008202 c000000000018990
GPR08: 0000000000000000 0000000000000000 c000000163428000 0000000000000001
GPR12: 0000000000008000 c000007ffffdbc00 0000000000000003 00007fffb274c1a8
GPR16: 00007ffffcfc341c 00007fffb274c9d0 00007fffb274c1a8 0000000000000000
GPR20: fffffffffffffff9 0000000000000001 fffffffffffffffb c00000012b3b67c8
GPR24: c00000012b3b6c60 c000000113973c60 0000007ffc700000 c000000001831038
GPR28: c00000012b3b6c60 c00000012b3b6000 c000000113973000 c000000163428000
[ 2423.753868] NIP [c0000000000189c8] __switch_to+0x318/0x520
[ 2423.753889] LR [c000000000018990] __switch_to+0x2e0/0x520
[ 2423.753901] Call Trace:
[ 2423.753906] [c00000016342b420] [c000000000018990] __switch_to+0x2e0/0x520 (unreliable)
[ 2423.753939] [c00000016342b480] [c000000000f93150] __schedule+0x300/0xba0
[ 2423.753971] [c00000016342b550] [c000000000f93a78] schedule+0x88/0x190
[ 2423.754000] [c00000016342b5c0] [c000000000f941e0] schedule_preempt_disabled+0x20/0x30
[ 2423.754032] [c00000016342b5e0] [c000000000217600] rwsem_down_write_slowpath+0x2d0/0x860
[ 2423.754064] [c00000016342b6c0] [c000000000f984ec] down_write+0x7c/0x80
[ 2423.754094] [c00000016342b6f0] [c008000030007018] os_acquire_rwlock_write+0x50/0xa0 [nvidia]
[ 2423.754405] [c00000016342b720] [c008000030c9c984] _nv042349rm+0x24/0x90 [nvidia]
[ 2423.754795] [c00000016342b750] [c0080000301005f4] _nv043550rm+0x244/0x4a0 [nvidia]
[ 2423.755089] [c00000016342b810] [c008000030f984e8] rm_read_registry_dword+0x48/0x120 [nvidia]
[ 2423.755398] [c00000016342b860] [c00800002fff136c] nv_start_device+0x714/0x8e0 [nvidia]
[ 2423.755628] [c00000016342b910] [c00800002fff1600] nv_open_device+0xc8/0x330 [nvidia]
[ 2423.755869] [c00000016342b9a0] [c00800002fff2370] nvidia_open+0x1b8/0x5a0 [nvidia]
[ 2423.756187] [c00000016342ba50] [c00000000059e3a0] chrdev_open+0x180/0x3c0
[ 2423.756218] [c00000016342bac0] [c000000000588a1c] do_dentry_open+0x27c/0x530
[ 2423.756251] [c00000016342bb10] [c0000000005ae458] do_last+0x1c8/0xb60
[ 2423.756283] [c00000016342bbe0] [c0000000005b2854] path_openat+0x124/0x410
[ 2423.756314] [c00000016342bc70] [c0000000005b4d10] do_filp_open+0x90/0x170
[ 2423.756345] [c00000016342bda0] [c00000000058c5f8] sys_openat+0x288/0x3a0
[ 2423.756378] [c00000016342be20] [c00000000000b408] system_call+0x5c/0x70
[ 2423.756412] Instruction dump:
[ 2423.756440] e92a0010 7c7e1b78 71280008 4182001c 7929e042 39000001 79292000 f92a0010
[ 2423.756485] e92d0030 7d09d92e 783f0464 e93f0000 <e8690c70> 2fa30000 419e0020 4bfff7a5
[ 2423.756526] ---[ end trace 4c03f5c49f6427d8 ]---
[ 2423.821191]
[ 2423.821229] Sending IPI to other CPUs
[ 2425.113900] IPI complete
[ 2425.190903] kexec: Starting switchover sequence.
Another crash, that the kernel thought was an exploit:
[13223.767881] cuda_debug[7297]: User access of kernel address (7fffa78f02d8) - exploit attempt? (uid: 1499401105)
[13223.767925] cuda_debug[7297]: User access of kernel address (7fffa78ee8f0) - exploit attempt? (uid: 1499401105)
[13223.768362] Unable to handle kernel paging request for data at address 0x00000c70
[13223.768384] Faulting instruction address: 0xc0000000000189c8
[13223.768396] Oops: Kernel access of bad area, sig: 11 [#1]
[13223.768414] LE SMP NR_CPUS=2048 NUMA PowerNV
[13223.768436] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs lockd grace fscache 8021q garp mrp stp llc bonding nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(OE) nft_ct nf_tables_set nvidia(POE) nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink i2c_dev mlx5_ib ib_uverbs xts vmx_crypto ib_core ofpart ipmi_powernv ipmi_devintf powernv_flash at24 ipmi_msghandler ibmpowernv mtd opal_prd uio_pdrv_genirq uio auth_rpcgss sunrpc xfs raid1 sd_mod t10_pi sg uas usb_storage mlx5_core bnx2x ast drm_shmem_helper i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt drm ahci libahci libata tg3 mlxfw tls mdio libcrc32c psample drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
[13223.768721] CPU: 56 PID: 7297 Comm: cuda_debug Kdump: loaded Tainted: P OE -------- - - 4.18.0-553.16.1.el8_10.ppc64le #1
[13223.768767] NIP: c0000000000189c8 LR: c000000000018990 CTR: c0000000001d33e0
[13223.768799] REGS: c00000012f9ff460 TRAP: 0300 Tainted: P OE -------- - - (4.18.0-553.16.1.el8_10.ppc64le)
[13223.768835] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28008242 XER: 20040000
[13223.768872] CFAR: c0000000000189a4 DAR: 0000000000000c70 DSISR: 40000000 IRQMASK: 1
GPR00: c000000000018990 c00000012f9ff6f0 c00000000220f200 c000000113a99800
GPR04: c00000012f96c460 c00000012f96b800 0000000028008242 c000000000018990
GPR08: 0000000000000000 0000000000000000 c00000012f9fc000 0000000000000001
GPR12: 0000000000008000 c000007ffffc0c00 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 c00000012f96bfc8
GPR24: c00000012f96c460 c000000113a9a460 0000007ffdb00000 c000000001831038
GPR28: c00000012f96c460 c00000012f96b800 c000000113a99800 c00000012f9fc000
[13223.769097] NIP [c0000000000189c8] __switch_to+0x318/0x520
[13223.769128] LR [c000000000018990] __switch_to+0x2e0/0x520
[13223.769147] Call Trace:
[13223.769160] [c00000012f9ff6f0] [c000000000018990] __switch_to+0x2e0/0x520 (unreliable)
[13223.769176] [c00000012f9ff750] [c000000000f93150] __schedule+0x300/0xba0
[13223.769190] [c00000012f9ff820] [c000000000f93a78] schedule+0x88/0x190
[13223.769212] [c00000012f9ff890] [c000000000f9b408] schedule_timeout+0x398/0x430
[13223.769244] [c00000012f9ff9a0] [c000000000f94e94] wait_for_common+0x324/0x3a0
[13223.769276] [c00000012f9ffa20] [c00800001d004ecc] _raw_q_flush+0x74/0xb0 [nvidia_uvm]
[13223.769330] [c00000012f9ffaa0] [c00800001d0052ac] nv_kthread_q_flush+0x34/0xd0 [nvidia_uvm]
[13223.769368] [c00000012f9ffb10] [c00800001d025824] uvm_va_space_destroy+0x2cc/0x5c0 [nvidia_uvm]
[13223.769416] [c00000012f9ffbd0] [c00800001d0084f8] uvm_release.isra.15+0xd0/0x1f0 [nvidia_uvm]
[13223.769458] [c00000012f9ffc10] [c00800001d00877c] uvm_release_entry+0xb4/0xf0 [nvidia_uvm]
[13223.769495] [c00000012f9ffc80] [c000000000596350] __fput+0xf0/0x360
[13223.769525] [c00000012f9ffce0] [c0000000001acc68] task_work_run+0x148/0x1a0
[13223.769556] [c00000012f9ffd30] [c00000000001aa44] do_notify_resume+0x454/0x4a0
[13223.769589] [c00000012f9ffe20] [c00000000000dec4] ret_from_except_lite+0x70/0x74
[13223.769620] Instruction dump:
[13223.769638] e92a0010 7c7e1b78 71280008 4182001c 7929e042 39000001 79292000 f92a0010
[13223.769674] e92d0030 7d09d92e 783f0464 e93f0000 <e8690c70> 2fa30000 419e0020 4bfff7a5
[13223.769713] ---[ end trace d44c545e03b97d17 ]---
[13223.854552]
[13223.854591] Sending IPI to other CPUs
[13225.146409] IPI complete
[13225.213333] kexec: Starting switchover sequence.
The system and kernel:
Linux power.localdomain 4.18.0-553.16.1.el8_10.ppc64le #1 SMP Thu Aug 1 04:16:35 EDT 2024 ppc64le ppc64le ppc64le GNU/Linux