8

I am trying to backup our postgres database (8.4.17), which is about 25GB in size.

pg_dump database_name > database_db_dump_2014-05-05.sql

unfortunately, the backup stops after about 600MB in, and does not continue. It always stops when trying to dump a specific table (fb_crawler_event). I was able to successfully dump all the other tables using the --table swtich. I have already stopped any other process that could interact with the database.


server log after a restart of the DB (looks ok to me):

2014-05-05 14:34:46 CEST LOG:  all server processes terminated; reinitializing
2014-05-05 14:34:46 CEST LOG:  database system was interrupted; last known up at 2014-05-05 14:32:50 CEST
2014-05-05 14:34:46 CEST LOG:  database system was not properly shut down; automatic recovery in progress
2014-05-05 14:34:46 CEST LOG:  record with zero length at 1A1/AD6A78C0
2014-05-05 14:34:46 CEST LOG:  redo is not required
2014-05-05 14:34:47 CEST LOG:  database system is ready to accept connections
2014-05-05 14:34:47 CEST LOG:  autovacuum launcher started

verbose output from pg_dump (nothing suspicious, but the obvious table, which doesn't let the database continue to dump)

... lots of lines ... pg_dump: restoring data for table "django_site" pg_dump: dumping contents of table django_site pg_dump: restoring data for table "fb_crawler_event" pg_dump: dumping contents of table fb_crawler_event

And then it just halts.


output from the pg_locks table, which seems to be huge to me (294 entries all together):

  locktype  | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |      mode       | granted 
------------+----------+----------+------+-------+------------+---------------+---------+-------+----------+--------------------+-------+-----------------+---------
 relation   |    16384 |     2674 |      |       |            |               |         |       |          | 1/23               | 19526 | AccessShareLock | t
 relation   |    16384 |    27367 |      |       |            |               |         |       |          | 1/23               | 19526 | AccessShareLock | t
 relation   |    16384 |   695092 |      |       |            |               |         |       |          | 1/23               | 19526 | AccessShareLock | t
 relation   |    16384 |     2675 |      |       |            |               |         |       |          | 2/7                | 18960 | AccessShareLock | t
 relation   |        0 |     2671 |      |       |            |               |         |       |          | 1/23               | 19526 | AccessShareLock | t
 virtualxid |          |          |      |       | 2/7        |               |         |       |          | 2/7                | 18960 | ExclusiveLock   | t
...

I am no expert in postgres, so I am wondering if the database is corrupt?

How can I further debug this anomaly?

devsnd
  • 221
  • 1
  • 2
  • 5

1 Answers1

4

I was able to figure out which file in the database storage was the culprit, by copying all the files to /dev/null.

cp -vR /usr/lib/postgresql/8.4 /dev/null

(The path to your DB files might differ)

The currupt file could not be copied, but there was nothing I could do to change that. (so it was most probably a FS error or hardware failure)

So I restarted the server with a forced fsck (e.g. touch /forcefsck), to make sure the FS would do the best to fix itself. This might not be the way you'll want to go, since it is possible to have a total data loss afterwards, but I was able to preserve the most precious data already beforehand, so I took the risk.

After reboot I could finally access the inaccessable table again, but I am not sure, if the data contained is corrupted or not. Anyway, I do have a backup now, which I can disect to find out, and my server can go back online for now...

I recommend reading the wiki of postgres about corruption and the slides of this FOSDEM presentation for some more info on DB corruption

devsnd
  • 221
  • 1
  • 2
  • 5