How to get postgres to resume processes after it hangs when the RAID drive hosting the tablespace times out

Question

Nearly once per day I'm getting a timeout that shows up in console like:

7:1:36 PM kernel: Areca RAID volume scsi command timeout (target=0 lun=0)

which corresponds exactly to when all postgres processes hang.

I'm reluctant to terminate pids because that tends to bring everything down. Is there anyway to resurrect these processes? I can't even open a new postgres window in terminal.

Also, is there any way to make postgres more tolerant of such timeouts? My System:

Mac Pro (2013) with 12 core X 64GB
OSX 10.10.4
Postgres 9.4.4

the drives I'm using for tablespaces (4 of them) are:

Areca 8050 T2
each with 8 1TB OWC Mercury Electra 6G SSDs in a RAID 0 config.

Craig Ringer · Accepted Answer · 2015-08-14T08:40:48.833

You very likely have one or more failing disks, or a failing RAID controller. This would be a really good time to migrate to another host.

Also, RAID 0 across eight disks? That's asking for trouble.

In a case like this PostgreSQL will be waiting in a system call into the kernel, like a pwrite(), fsync() or similar. PostgreSQL isn't being tolerant or intolerant of your raid controller issues. It's just patiently waiting until the kernel finishes doing what PostgreSQL asks it to do and returns control to PostgreSQL. This is exactly what should happen. The problem is that the controller isn't responding so the system call never returns. You should be able to confirm this with strace or ps (though I'm not sure of the exact options on OS X).

How to get postgres to resume processes after it hangs when the RAID drive hosting the tablespace times out

1 Answers1