2

We're working on refactoring many of our offline (python) scripts. I'm very new to any kind of SQL, but have been trying to learn as much as I can, especially in the performance field.

We have a script that loads a CSV from a client, that has user-defined fields in it, that must be matched to unique names (and eventually keys) in a separate table.

We write each row of the CSV to our DB in 4 different queries, to 4 different tables. All four queries relate to the same row of data, and use the first's primary key in each.

The latter 3 use the primary key from the first insert, so I assume that's a constraint.

We're getting around 100-150 IOPs on our Amazon RDS instance, so it's painfully slow trying to do something like this. We proved (well, somewhat), that the script is not the bottleneck. I created another script that just ran a loop to 'insert' into a dummy table with no indexes, and it ran about as fast.

Is there anything we can do to optimize our current method? I think there has to be, I couldn't imagine we really have to perform 2 million queries for our theoretical max-size CSV file of 500,000 rows.

I've ruled out (or think I have, I could be very, very wrong) a few options:

  1. batch-inserts: Since each insert goes to a separate table, can we really batch insert?
  2. stored procedure/functions: The headers are dynamic, so I don't think there's anything we can really do in that department.
  3. Building a query through string concatenation: Mostly because I don't know even where to start with this, but I also think it would be impossible since we need the keys from the first insert.
Travis
  • 211
  • 1
  • 2
  • 5

1 Answers1

1
  1. Read the entire CSV into a new MyISAM table with no indexes.
  2. Write SQL statements to migrate that data en-masse to the other 4 tables.
  3. Drop that table.

100-150 IOPs -- not bad. That's about all you can get with a consumer-quality disk.

So, the trick is to minimize the IOPs.

  • What engine are you using? (InnoDB has some tunables that can help/hurt.)
  • What indexes? (The more indexes, the more I/O. And the cache may not be set optimally.)
Rick James
  • 80,479
  • 5
  • 52
  • 119