3

While attempting a backup of a pretty large folder (450G) to a 2TB drive that's in that server solely as a backup destination rdiff-backup (version 1.2.8 - last marked stable) caused a kernel panic.

System:

Linux giorgio 3.2.0-4-amd64 #1 SMP Debian 3.2.51-1 x86_64 GNU/Linux

Disks: 2 1TB disks in software mirror RAID mode, 1 2TB disk solely for backups.

I have a suspicion: memory on the server is 2G RAM + 2G swap = 4G. There are files up to 16G in size. Is it possible that rdiff-backup at some point loads the entire file into memory?

In any case, a kernel panic should not have happened (since the rdiff process was killed? so the memory should have been made available again?), so I guess my question has two parts, one: about my suspicion, two: about the kernel panic.

By the way, the panics started recently, quite a number of backups had already been successful - full and incremental - and those big GB files had already been there. So I guess it's the new Debian kernel's fault rather than rdiff-backup's?

Logfile section at the time the panic happens http://pastebin.com/e9a5fQdh

Last thing on the screen:

EDIT/Update: I just tried creating a 20GB swap file (with dd from /dev/zero) and the server went DOWN again, no reaction to ping.

From looking at the logs: It seems the kernel has killed some processes - including the one I suspect of having caused it all (rdiff-backup) - but says "running out of killable processes". It seems that killing the processes did not free the memory?

Mörre
  • 133
  • 6

1 Answers1

5

It didn't kill rdiff-backup, it should have but its oom_score_adj is -1000.

This is caused by a bug in sshd. The bug is fixed but wont be available until the next release which is openssh 6.5.

sshd fails to set the oom_score_adj of new shells it creates back to 0 if you reload it, causing all child processes you spawn via SSH (so your bash shell and any child processes that creates) to have -1000 oom_score_adj and subsequently can hog all the memory without oom-killer killing them.

The quickest way to fix this is to (assuming 7567 is the pid of sshd like in your case):-

  • Run echo 0 >/proc/7567/oom_adj_score
  • Restart sshd.

Do not reload sshd, restart it until the fix is in place. (openssh 6.5 shall have it)

The bug is reported and fixed here. https://bugzilla.mindrot.org/show_bug.cgi?id=2156

Matthew Ife
  • 24,261