Writing large dataset from spark dataframe

Asked Feb 16 '24 at 00:57

Active Feb 16 '24 at 00:57

Viewed 207 times

We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all).

I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks.

I don't have a ton of spark or databricks experience, so any tips are appreciated.

asked Feb 16 '24 at 00:57

Kyle Chamberlin

Writing large dataset from spark dataframe

0 Answers0