0

I have a SQLite table named hashbands that looks like this:

hashband | file_id | window_id
------------------------------
potato   | 0       | 0
potato   | 1       | 0

The table has 100M+ rows (lol). I need to find all hashbands that have two or more distinct file_id values, and then I need to fetch all rows that contain those hashbands and order the results by hashband. Right now I'm using this:

  WITH file_id_counts AS (
    SELECT hashband, COUNT(DISTINCT(file_id)) as count
    FROM hashbands
    GROUP BY hashband
    HAVING COUNT > 1
  ) SELECT hashband, file_id, window_id
    FROM hashbands
    WHERE hashband IN (SELECT hashband from file_id_counts)
    ORDER BY hashband

Does anyone see ways to expedite this query? Any pointers would be helpful!

duhaime
  • 121
  • 5

2 Answers2

2

Likely the majority of the performance issue you're seeing is just the normal constraints of the data limitations of SQLite. Your query is of sound mind, and I don't believe there's much you can do to optimize it other than re-write the WHERE predicate to something more efficiently relational with an INNER JOIN like this:

WITH file_id_counts AS (
    SELECT hashband, COUNT(DISTINCT(file_id)) as count
    FROM hashbands
    GROUP BY hashband
    HAVING COUNT > 1
  ) SELECT hashband, file_id, window_id
    FROM hashbands
    INNER JOIN file_id_counts
        ON hashbands.hashband = file_id_counts.hashband
    ORDER BY hashband

This is logically equivalent and the reasoning it could be faster is because the IN clause is syntactical sugar for a bunch of OR clauses which can be less efficient than a direct single equality operator like the INNER JOIN is doing above.

Additionally, ensuring your hashbands table has an index on at least (hashband) or possibly on (hashband, file_id) should help (if one doesn't exist already).

Finally, if possible, removing your ORDER BY clause, and instead doing the sorting in your consuming application would probably help a small amount as well. While this mostly just moves the onus of the sort to a different part of the call stack, generally sorting in the database can add a little extra complexity that sometimes can be resolved and slightly more efficient to do in the consuming application. Plus, sorting is really presentation logic in my opinion (at least when not used for a functional purpose of the query).

Where does this SQLite database live?...a mobile application, or on its own server?

J.D.
  • 40,776
  • 12
  • 62
  • 141
2

Using a window aggregate would presumably be substantially faster than the self-join that you currently have. This is available from version 3.25

Unfortunately, in SQLite, DISTINCT cannot be used in an OVER aggregate, but we can simulate it with MAX of DENSE_RANK (if file_id is nullable then it can be a little more complex)

The calculation is quite simple: DENSE_RANK will give us a ranking, but with tied results, e.g. 1,1,2,3,3,3,4. Then we use MAX to get the highest of that. PARTITION BY works like GROUP BY but for window aggregates, in other words the function is calculated separately over each partition.

WITH file_id_ranked AS (
    SELECT *,
        DENSE_RANK() OVER (PARTITION BY hashband ORDER BY file_id) rnk
    FROM hashbands
),
file_id_counts AS (
    SELECT *,
        MAX(rnk) OVER (PARTITION BY hashband) AS count
    FROM file_id_ranked
)
SELECT hashband, file_id, window_id
    FROM file_id_counts
    WHERE count > 1
    ORDER BY hashband;

I strongly recommend an index on hashband, file_id, window_id in that order.

Charlieface
  • 17,078
  • 22
  • 44