Reading a large CSV file and then loading data to a DB

Question

I have a Django application of 2 GB running and I need to receive a CSV file of more than 1 GB, read it and load the data to a PostgreSQL DB in IBM Cloud. The problem is that if I receive the file, it would have to be stored locally and I will definitely have to increase the memory of the server or handle it in a different way.

One idea will be stored it in a S3 bucket and then read it by pieces, but I don't know how to achieve that using Python because the record's size is not fixed. I can't load the data using the aws_s3 PostgreSQL extension because it does not exists in IBM Cloud Postgres service or anything similar. If I am right, I can't install an extension either.

Another way would be use an ETL solution for this kind of jobs, but I don't know any in particular that fits my requirements.

Right now I just created a different instance with greater memory, turn it on when I need to load the data and turn it down when it is finished.

score 3 · Answer 1 · answered Jul 30 '21 at 23:26

Another option: you read the data as you are receiving it. It may be less straightforward than simply reading the submitted follow at once, but it's is a usual approach when dealing with large files.

Note that:

It may be faster to process the file by chunks of N lines, rather than line by line. Test it to see if this is the case.
You might be unable to use the Python's default parser for CSV, and would need to draft your own.

score 0 · Answer 2 · answered Jul 31 '21 at 04:35

May be you can use one of the below.

Write a stored procedure which will do the import from a url if the file has a separate url. Example - https://stackoverflow.com/questions/41696675/how-to-copy-a-csv-file-from-a-url-to-postgresql
If the data is uploaded using a file, obviously read it in chunks and upload it.

Reading a large CSV file and then loading data to a DB

2 Answers2