-2

Every now and then, one of our remote Linux servers crashes: they're unavailable on the network (sometimes responding to a ping, but not to ssh/http) and they won't respond to mouse or keyboard input.

The servers are high-quality consumer grade hardware running Ubuntu 20.04.3 LTS.

Since these crashes happen infrequently, I'm collecting all the common reasons a server might crash like that so I can set up monitoring (munin) to make sure I have all the information needed when it happens and implement countermeasures (eg. periodic restarts?).

Question:

What are reasons for a Linux computer to become unresponsive, what info can I track to diagnose these issues, what can I do to fix them?

I believe this question and answers will be most useful if there's one answer per cause of failure and I'll be posting answers myself as I find such causes.

4 Answers4

0

Reason: Excessive swapping

can cause a system freeze (though this would usually be transient).

Track: RAM and swap usage

Fix: Increase RAM, tune services, (maybe) increase swap

See here

0

Reason: Excessive RAM/CPU usage

Track: RAM and swap usage, resource-hungry processes, their logs

Fix: Increase RAM, tune services, debug resource-hungry processes to see under which conditions their resource consumption spikes

0

Reason: HD Write Failures

Track: SMART diagnostics

Fix: Replace failing disks

-1

You have nice pointers here from the previous comments.

You might also want to stop your server for a weekend (if possible) and test the ram with Memtest86.

You burn a cd or a iso to a usb key and start the machine with it. I understand you have physical access to it.

yield
  • 858