0

I have an older HP Z440 tower with 4x8GB ECC DDR4, running Proxmox VE 6.4. Recently, it started showing MCE errors every few seconds. I installed rasdaemon and can see that they are memory read errors. However, edac-util doesn't show any sign of problems. Memtest passed, but I understand that's normal for correctable errors.

There is only one socket, and the DIMMs are installed in slots 1, 3, 6, and 8 (which seems to be preferred for this model).

Am I actually having memory errors? How can I troubleshoot this further?

dmesg:

root@pve:~# dmesg
...
[ 5729.899255] mce_notify_irq: 20 callbacks suppressed
[ 5729.899260] mce: [Hardware Error]: Machine check events logged
[ 5732.907207] mce: [Hardware Error]: Machine check events logged
[ 5792.907319] mce_notify_irq: 19 callbacks suppressed
[ 5792.907323] mce: [Hardware Error]: Machine check events logged
[ 5793.899247] mce: [Hardware Error]: Machine check events logged
[ 5852.911342] mce_notify_irq: 11 callbacks suppressed
[ 5852.911347] mce: [Hardware Error]: Machine check events logged
[ 5853.903354] mce: [Hardware Error]: Machine check events logged

Errors from rasdaemon:

root@pve:~# ras-mc-ctl --errors | tail
1435 2023-05-12 14:58:05 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=5, mcgcap=0x07000c16, status=0xcc00014000010091, addr=0x4ccdc28c0, misc=0x40484886, walltime=0x645e9a4e, cpuid=0x000306f2, bank=0x00000007
1436 2023-05-12 14:58:06 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x4d5c831c0, misc=0x140383886, walltime=0x645e9a4f, cpuid=0x000306f2, bank=0x00000007
1437 2023-05-12 14:58:09 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x4ccdc28c0, misc=0x403aba86, walltime=0x645e9a52, cpuid=0x000306f2, bank=0x00000007
1438 2023-05-12 14:58:11 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x6fd8eee80, misc=0x140282886, walltime=0x645e9a54, cpuid=0x000306f2, bank=0x00000007
1439 2023-05-12 14:58:12 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=2, mcgcap=0x07000c16, status=0xcc00008000010091, addr=0x510122800, misc=0x140282886, walltime=0x645e9a55, cpuid=0x000306f2, bank=0x00000007
1440 2023-05-12 14:58:13 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=4, mcgcap=0x07000c16, status=0xcc00010000010091, addr=0x4ea312a80, misc=0x1403c3c86, walltime=0x645e9a56, cpuid=0x000306f2, bank=0x00000007
1441 2023-05-12 14:58:16 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x4ea342a80, misc=0x1403aba86, walltime=0x645e9a59, cpuid=0x000306f2, bank=0x00000007
1442 2023-05-12 14:58:17 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Corrected_error, n_errors=1, mcgcap=0x07000c16, status=0x8c00004000010091, addr=0x50abf2900, misc=0x1404c4c86, walltime=0x645e9a5a, cpuid=0x000306f2, bank=0x00000007
1443 2023-05-12 14:58:18 -0500 error: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error, mcg mcgstatus=0, mci Error_overflow Corrected_error, n_errors=8, mcgcap=0x07000c16, status=0xcc00020000010091, addr=0x52676fbc0, misc=0x140585886, walltime=0x645e9a5b, cpuid=0x000306f2, bank=0x00000007

No errors reported by edac:

root@pve:~# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
edac-util: No errors to report.

root@pve:/sys/devices/system/edac/mc# tail -n +1 mc/ce_ mc/dimm/dimm_ce_count ==> mc0/ce_count <== 0

==> mc0/ce_noinfo_count <== 0

==> mc0/dimm0/dimm_ce_count <== 0

==> mc0/dimm3/dimm_ce_count <== 0

==> mc0/dimm6/dimm_ce_count <== 0

==> mc0/dimm9/dimm_ce_count <== 0

1 Answers1

1

My understanding is that edac-utils no longer works after updates to HERM (Hardware Event Report Mechanism) broke its functionality, because it relied on memory error counters being exposed to userspace. Instead, memory errors now stay in the kernel and a userspace daemon has to collect them (rasdaemon). So edac-utils reported no errors as there were no error reports in the place it expected to find them.

There's a slightly convoluted account on the github page for rasdaemon https://github.com/mchehab/rasdaemon. But in answer to your question: yes, you were very likely having memory errors.

advert665
  • 111