23

Running Linux I have a few processes which tend to crash occasionally (game servers), which end up using 100% cpu.

I'm looking for a program or script to check the cpu usage of a list of processes by name and if they are at 100% for more than X time, say 30 seconds, kill them. I tried ps-watcher but wasn't able to determine how to accomplish this.

Just killing the process at 100% usage won't work as it will hit that for brief periods during normal operation.

I've also found this script which seems to do what I want, however it is limited to one process: link

Any help is greatly appreciated!

user30153
  • 231

3 Answers3

22

Try monit.

You could use a configuration like this, to accomplish your task:

check process gameserver with pidfile /var/run/gameserver.pid
  start program = "/etc/init.d/gameserver start" with timeout 60 seconds
  stop program  = "/etc/init.d/gameserver stop"
  if cpu > 80% for 2 cycles then alert
  if cpu > 95% for 5 cycles then restart
  if totalmem > 200.0 MB for 5 cycles then restart
  if loadavg(5min) greater than 10 for 8 cycles then stop
  if failed port 12345 type tcp with timeout 15 seconds
    then restart
  if 3 restarts within 5 cycles then timeout

Details about this configuration can be found in monit's documentation.

joschi
  • 21,955
4

This was what I was looking for, and have been using it for some time now (slightly altered). Lately, I've put a bug in my work but need to keep the app (game server) running.
I had quoted out the part where topmost PID is killed, as it was killing the wrong PID.
Here's my latest draft of your script, so far, it finds the top-most overload and effectively kills it (also emails me with the info whenever it does anything);

#!/bin/bash

Note: will kill the top-most process if the $CPU_LOAD is greater than the $CPU_THRESHOLD.

echo echo checking for run-away process ...

CPU_LOAD=$(uptime | cut -d"," -f4 | cut -d":" -f2 | cut -d" " -f2 | sed -e "s/.//g") CPU_THRESHOLD=300 PROCESS=$(ps aux r) TOPPROCESS=$(ps -eo pid -eo pcpu -eo command | sort -k 2 -r | grep -v PID | head -n 1) TOPPROCESSPID=$(echo "$TOPPROCESS"|awk '{print $1}')

if [ $CPU_LOAD -gt $CPU_THRESHOLD ] ; then

kill -9 $(ps -eo pid | sort -k 1 -r | grep -v PID | head -n 1) #original

kill -9 $(ps -eo pcpu | sort -k 1 -r | grep -v %CPU | head -n 1)

kill -9 $TOPPROCESSPID echo system overloading! echo Top-most process killed $TOPPROCESS echo load average is at $CPU_LOAD echo echo Active processes... ps aux r

send an email using mail

SUBJECT="Runaway Process Report at Marysol"

Email To ?

EMAIL="myemail@somewhere.org"

Email text/message

EMAILMESSAGE="/tmp/emailmessage.txt" echo "System overloading, possible runaway process."> $EMAILMESSAGE echo "Top-most process killed $TOPPROCESS" >>$EMAILMESSAGE echo "Load average was at $CPU_LOAD" >>$EMAILMESSAGE echo "Active processes..." >>$EMAILMESSAGE echo "$PROCESS" >>$EMAILMESSAGE mail -s "$SUBJECT" "$EMAIL" < $EMAILMESSAGE

else echo echo no run-aways. echo load average is at $CPU_LOAD echo echo Active processes... ps aux r fi exit 0


This little script has been extremely useful, if you don't like it killing any process, the email alone will help keep you informed.
16851556
  • 546
0

Below is a sample BASH script that may help you get some hints for your own needs.

#!/bin/bash

CPU_LOAD=$(uptime | cut -d"," -f4 | cut -d":" -f2 | cut -d" " -f2 | sed -e "s/\.//g")
CPU_THRESHOLD=700

if [ $CPU_LOAD -gt $CPU_THRESHOLD ] ; then
  kill -9 $(ps -eo pid | sort -k 1 -r | grep -v PID | head -n 1)
fi

exit 0

Please take note that the value of your $CPU_THRESHOLD should depend on the number of (CPU) cores you have on your system. A detailed explanation about this topic can be found at http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages .

You can either call your script from inside the /etc/inittab or a cronjob for every number of minutes you prefer. Please take note also that the example script will kill the top-most process if the $CPU_LOAD is greater than the $CPU_THRESHOLD.

bintut
  • 339