Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

Question

I am working on a problem that requires a very large join. The JOIN itself is pretty straightforward but the amount of data I am processing is very large. I am wondering for very large JOINs, is there a preferred type of technology. For example, is it more effective to a Data Warehouse (like Snowflake) or in some other MPP system like Spark?

To make the problem more concrete I created a hypothetical problem similar to my actual problem. Assume I have a table that looks like this:

I am working on some logic that requires account pairs that have the same name. To find pairs of accounts with the same account I can easily do something like this:

SELECT 
    account1.name, 
    account2.name
FROM accounts as account1
JOIN accounts as account2 ON account1.name = account2.name AND account1.acount_id != account2.acount_id

The problem I am facing is due to the amount of data I am processing. There are roughly ~2 trillion records I am trying to self JOIN on. Obviously, this will take some time and some pretty serious compute. I have run a similar query in Snowflake using XL and 3XL warehouses but after several hours of running, I canceled them. I am hoping there is a more cost-effective or time-efficient way.

Has anyone had success with massive JOINs? Are there any other tricks I could deploy? What tool did you find the most effective?

score 1 · Answer 1 · answered Apr 25 '22 at 07:40

I am not sure whether exporting such amount of data would be faster/cheaper.

Instead of self-join I would try sorting or grouping:

GROUP by name
HAVING COUNT(*) > 1

OR

WITH cte as (
    select name, row_number() over (order by id partition by name) as rn
)
select name from cte where rn > 1;

And choose the biggest warehouse. The smaller warehouse the more spilling to disk, which significantly slows down the execution.

Best practices for large JOINs - Warehouse or External Compute (e.g. Spark)

1 Answers1