6

I'm running my RPi B in headless mode, controlling it via SSH (Raspbian).

I used to get random crashes under load, until I got a better power supply.

Now, I've found the problem has re-surfaced. I'm trying to compile a new version of Python, and it simply reboots during the make process. It's also crashed on other occasions, especially when it has been under heavy load.

I've removed all peripherals in order to lower the power consumption, but the issue remains.

Any ideas? I'm not sure where to begin investigating an unplanned reboot.

elsurudo
  • 163
  • 1
  • 6

3 Answers3

9

a bunch of wd_keepalive[2871]: unable to disable oom handling!. I'll have to try to force a crash again to see if the times of these coincide with the crashes...

wd_keepalive is a watchdog daemon described in man wd_keepalive. I do not think it is installed by default, so you are probably aware of it.

The OOM killer is a kernel feature intended to keep the system up when all available physical memory is consumed. When this happens, the system may become temporarily unresponsive, which may trigger your watchdog. This would explain the spontaneous reboots, since that is what a watchdog usually does.

/var/log/syslog most likely contains a record of the OOM killer activity. This is in the form of a table with columns like this:

[ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

The kernel makes its decision based on some algorithm that generally selects the process with the largest amount of memory consumed, the point being to free up as much memory as possible. However, this does not always stop the "culprit", which may be, for example, something which rapidly forks or balloons when not much was left. It is worth observing that there may not really be any culprit other than the one between chair and keyboard -- everything is behaving as it is supposed to, but you are asking for too much.

At the end of the table you will find the decision made, Out of memory: Kill process 1234 foobar where 1234 is a pid and foobar is the name of the process. Again, this does not necessarily indicate misbehavior, however, if something is misbehaving and that was not it, the OOM condition will probably reoccur quickly and you will find another table further on. By evaluating these you should be able to figure out the details of what happened.

To be clear: The intention of the OOM killer is not to disable the system. It is to free up memory in order to restore normal operation. This is not always effective, and the slow down created by the OOM condition may have triggered the watchdog (which is effective, at least in terms of restarting the system).

To help prevent or diagnose the issue in the future, you could use a monitor such as top or htop. For the long term (e.g., on a server), you could try monitoring specific processes with plog.

goldilocks
  • 60,325
  • 17
  • 117
  • 234
4

wd_keepalive[n] is a thread from your watchdog process. N is an integer specifying the pid of whichever process it's actively watching. From what I can tell, the error is a bug in watchdog. Esentially it's either causing or receiving a fork bomb which then causes the system to crash.

Try disabling watchdog, and then run your make process.

Jacobm001
  • 11,904
  • 7
  • 47
  • 58
1

I had a similar problem where the raspberry just crashed when using make. I changed the power adapter and it worked like magic.

Yossi Neiman
  • 109
  • 3