6

I have an application which generates a lot of data which needs to be inserted quickly (something around 13million records). I use JPA 2.0/Hibernate with Postgres 9.1, and I managed to achieve quite a good performance (around 25k inserts per second) with multi-threading and batching of inserts every few thousand inserts or so, completing a whole run in around 8mins.

However, I noticed that I had a few of the foreign keys which had an index missing, which I would really wish to have both from an analysis point of view to drill down in the data, and also to delete data to a specific run. Unfortunately when I added in these 3 indexes to the table that is getting most inserts, performance dropped down drastically to around 3k per second.

Is there any way to avoid this performance slow down? I know that one option is to drop the indexes before a run and recreate them in the end. Another more clumsy option is to generate the data of the biggest table in a file instead and use COPY. I guess I can only do it on the largest table in the relation, due to the foreign key values which I would need to know (generated through sequences).

Both alternatives seem to be hacks. Is there any other solution, maybe a bit less intrusive on the application? Some setting to tell postgres to defer indexing or something of that sort?

Any ideas welcome.

jbx
  • 163
  • 1
  • 4

2 Answers2

16

In addition to Craig's advice I would like to advise you to examine the storage parameters of the affected tables.

I am currently in a similar situation to yours. The largest table in my system contains ~200 million records and the performance was really bad.

Tune the storage parameters of your tables and indexes

Besides adding several indexes to the database, I changed the storage parameters of some tables and specified a custom value for the fillfactor of the table itself and the indexes.

Setting a custom value for the fillfactor allows you to instruct PostgreSQL how much space in each page should be reserved for further updates. The same applies to indexes.

See the documentation on CREATE TABLE and the description of the available storage parameters for details.

Monitor your infrastructure

Monitor and analyze your infrastructure. The PostgreSQL wiki lists a lot of usefull tools.

Find long running statements

Enable statement logging by altering the following values in your postgresql.conf file:

  • log_min_duration_statement=x to log all statments which run longer the x milliseconds
  • log_min_messages=level to a level what helps you the understand the statements generated by JPA

See the description of the runtime logging configuration for details

Install pgFounine to analyse your PostgreSQL log file easily.

Be picky

Besides altering the storage parameters I also gained a lot of performance by optimizing all frequently executed statements. In parts I won only 100 or 50 milliseconds for each execution but in total I gained more then 5 seconds for complex operations.

JNK
  • 18,064
  • 6
  • 63
  • 98
Oliver F.
  • 263
  • 1
  • 4
6

Deferred indexing would be nice, but isn't currently supported.

Adding indexes has a cost - write performance. They're a trade-off.

COPY won't help much if index maintenance is the main issue.

The simplest solution is to drop the indexes, and re-create them when you're done importing.

Since you can live with losing all your data if the DB crashes, you have a bunch of options to further improve performance, including:

  • Unlogged tables
  • fsync=off
  • Non-durable disk write caching on the drive that hosts the DB

The use of any of the above will eat your data if anything goes wrong. The last option might eat the file-system too.

I wrote more about this in https://stackoverflow.com/questions/9407442/optimise-postgresql-for-fast-testing .

Craig Ringer
  • 57,821
  • 6
  • 162
  • 193