We're working on refactoring many of our offline (python) scripts. I'm very new to any kind of SQL, but have been trying to learn as much as I can, especially in the performance field.
We have a script that loads a CSV from a client, that has user-defined fields in it, that must be matched to unique names (and eventually keys) in a separate table.
We write each row of the CSV to our DB in 4 different queries, to 4 different tables. All four queries relate to the same row of data, and use the first's primary key in each.
The latter 3 use the primary key from the first insert, so I assume that's a constraint.
We're getting around 100-150 IOPs on our Amazon RDS instance, so it's painfully slow trying to do something like this. We proved (well, somewhat), that the script is not the bottleneck. I created another script that just ran a loop to 'insert' into a dummy table with no indexes, and it ran about as fast.
Is there anything we can do to optimize our current method? I think there has to be, I couldn't imagine we really have to perform 2 million queries for our theoretical max-size CSV file of 500,000 rows.
I've ruled out (or think I have, I could be very, very wrong) a few options:
- batch-inserts: Since each insert goes to a separate table, can we really batch insert?
- stored procedure/functions: The headers are dynamic, so I don't think there's anything we can really do in that department.
- Building a query through string concatenation: Mostly because I don't know even where to start with this, but I also think it would be impossible since we need the keys from the first insert.