We have a azure databricks job that retrieves some large dataset with pyspark. The dataframe has about 11 billion rows. We are currently writing this out to a postgresql DB (also in azure). Currently we are using the jdbc connector to write row out in batch to the existing table (batch size 10,000,000). This table does have a handful of indexes on it, so inserts take awhile. It is dozens of hours to complete this operation (assuming if finishes successfully at all).
I feel like it would make more sense to use COPY to load the data into the database, but I don't see any well establish patterns for doing that in databricks.
I don't have a ton of spark or databricks experience, so any tips are appreciated.