2

I have a Django application of 2 GB running and I need to receive a CSV file of more than 1 GB, read it and load the data to a PostgreSQL DB in IBM Cloud. The problem is that if I receive the file, it would have to be stored locally and I will definitely have to increase the memory of the server or handle it in a different way.

One idea will be stored it in a S3 bucket and then read it by pieces, but I don't know how to achieve that using Python because the record's size is not fixed. I can't load the data using the aws_s3 PostgreSQL extension because it does not exists in IBM Cloud Postgres service or anything similar. If I am right, I can't install an extension either.

Another way would be use an ETL solution for this kind of jobs, but I don't know any in particular that fits my requirements.

Right now I just created a different instance with greater memory, turn it on when I need to load the data and turn it down when it is finished.

2 Answers2

3

Another option: you read the data as you are receiving it. It may be less straightforward than simply reading the submitted follow at once, but it's is a usual approach when dealing with large files.

Note that:

  • It may be faster to process the file by chunks of N lines, rather than line by line. Test it to see if this is the case.
  • You might be unable to use the Python's default parser for CSV, and would need to draft your own.
0

May be you can use one of the below.

  1. Write a stored procedure which will do the import from a url if the file has a separate url. Example - https://stackoverflow.com/questions/41696675/how-to-copy-a-csv-file-from-a-url-to-postgresql

  2. If the data is uploaded using a file, obviously read it in chunks and upload it.