unattended memtest: is it practical?

Question

I will be building a medium scale cluster (20 nodes, expanding later) and for various reasons, using commodity hardware should give me a significant cost saving (even allowing for shorter operational cycles / failures). My worry is about persistent memory faults.

The obvious solution here is to run memtest regularly on each node - but this poses 2 issues:

while memtest has a run-once then exit mode - how do I configure (in advance) what should happen after it exits (i.e. boot Linux)
the run-once mode simply halts if errors occur - how do I project that status out of the host?

score 1 · Answer 1 · answered Feb 19 '20 at 22:59

Practical? Not regularly as a part of ongoing operations. Waiting for downtime to burn in memory won't detect transient bit flips. And introduces significant lag in detecting persistent failures. Further, if you mean the open source memtest86+, there are integration challenges like no UEFI support and automating the reporting of failures.

Instead, get hardware with sufficient RAS features, namely ECC memory. Then your server can report memory failures to you.

Such errors might not be very common. Servers without ECC won't immediately crash and burn, that is a choice. However, often the price premium is small, if there even is a choice for non-ECC RAM on your server model.

score 0 · Answer 2 · answered Feb 20 '20 at 14:05

I now have an answer to the first part of my question. The grub distribution includes something called grubonce. Hence if Linux is my default in grub, I can ask grub to run memtest once (and thereafter it will revert to the default).

So far it seems my only option for the second part is to look out for a machine staying offline (i.e. not running Linux) after a scheduled memtest is expected to complete.

score 0 · Answer 3 · answered Feb 20 '20 at 15:40

May I know what application do you run and what do you mean by persistent memory fault?

AFAIK a lot of today applications run really well on non-ECC RAM and most of the crash are not related to ECC issue but rather out-of-memory or bug.

And scanning the RAM to identify an error is very inefficient. The first place you could identify the potential error is from the log file, only if you found a symptom then you will have to run memtest.

I think it would be good to clarify your logic behind doing this first to identify a better solution, what do you think?

unattended memtest: is it practical?

3 Answers3