What is the practical use case for a very large (terabyte-scale) heap?

Question

This question is the highest rated question on this website about heaps. It asks about the use case of heaps. However, I believe that very large heaps (e.g. thousands of gigabytes) are a special case that deserves its own question.

As the size of your clustered index grows, so too does the clustered index penalty i.e. the number of logical reads required to get the remaining keys from your non-clustered indexes. The same is not true of a heap. It is my belief that heaps are unfairly dismissed because too many people either learned all of their database design from data warehouses (where primary key scans, and therefore clustered indexes, are generally a good idea) or have never worked on a database large enough to feel the cost of having a really big table.

This gives me my question: Do well-indexed very large heaps have practical use cases that are not found in well-indexed very large clustered tables?

For fear of making this question too broad, let the following constraints apply. Assume:

A 1 TB heap
A Standard Edition box
An OLTP environment (Not a data warehouse).

Cumulatively, this means that you are limited to 128 GB of buffer pool RAM and will not regularly do large scans on the primary key. Reading the whole heap from disk is going to be miserable, so any answers about using the table as a staging or backup-only table should not apply.

Paul White · Accepted Answer · 2024-12-19T01:55:50.730

Do well-indexed very large heaps have practical use cases that are not found in well-indexed very large clustered tables?

Of course. Clustered tables have many advantages, but they're not always the optimal solution. Note though that some SQL Server features require a clustered index or primary key.

A clustered b-tree index largely materialised in the logical ordering of its leaf level pages does provide some benefits 'for free', but there are also disadvantages:

You can only have one such index, since it is the primary storage object.
The upper levels of the b-tree can represent an overhead as mentioned in the question's linked article.

The extra I/O effect is real, though often overstated. The upper levels do have a tendency to remain in cache, but this is not guaranteed. In any event, the extra pages must be latched and navigated for each lookup.
The clustered index includes all in-row columns and off-row pointers at the leaf. This makes the space between leaf index keys maximally large. In other words, the clustered index is usually the least dense index.

The same table organised as a heap would require a nonclustered index to provide a similar access method. The upper levels of that nonclustered index will be very similar to the clustered case, but the leaf level pages will be much denser, in general.

Most very large OLTP tables require multiple indexes to support a good range of queries. These nonclustered indexes will normally be narrow, preferring highly selective index searches with a small number of bookmark lookups over wider, covering indexes. Fully covering indexes are often not practical in this scenario.

Each of these access paths benefits from the higher leaf density and/or direct access provided by the RID row locator, rather than indirection via the clustering key(s) and the associated b-tree.

There was a time when single-column integer clustered keys were (too) common, making the 8-byte RID twice as large. This is much less the case these days, as even so-called 'surrogates' are usually bigint. Many clustered indexes are of a wider type, string-based, or multi-column. The size of an RID is more often an advantage now.

Replacing a very large clustered table with a heap and single nonclustered index will not generally produce benefits. The advantage, if any, comes from an increasing number of optimal nonclustered indexes needed to support the workload with acceptable performance. Each nonclustered access method benefits from the more efficient lookup path.

In the limited scenario described in the question, the heap arrangement is worth investigating if there are many highly selective OLTP queries that would be best served by a medium to large number of highly optimized, narrow nonclustered indexes with selective RID lookups as well as index-only range scans.

How much benefit you see depends on the performance characteristics of the hardware, and the fine details of execution plans selected by the optimizer (prefetching in particular).

Depending on other tables in your workload, the large heap with secondary indexes may benefit the built-in star join optimizations. These plans often feature base table lookups (fetches).

heap lookup

All that said, none of this is new. It's fairly rare that this aspect of performance would trump all other considerations. Nevertheless, it is a genuine factor that can be important to consider. Most people will end up perfectly happy with a clustered scheme, even if slightly better performance is technically achievable with a heap base.

Downsides

All the usual caveats with heaps still apply. Be aware of space management issues and how widening updates can result in forwarded records.

In particular, though the more direct lookups can benefit read performance, single row insert performance may be much worse. This is because heaps do not maintain a fixed insertion point for new rows—the new row will be added wherever there is free space.

This requires a free space scan, in IAM-order, using PFS pages to locate an existing page with sufficient free space available. For an extremely large heap with little or no existing free space, this scan can add very significant overhead.

Metadata caching means subsequent row insertions in the same statement do not pay the full price. A single row insert into a huge full heap is the worst-case scenario. Remember that heap tables can also be partitioned, which can mitigate this concern if each partition is kept to a reasonable size.

score 1 · Answer 2 · edited Dec 19 '24 at 01:50

One thing I didn't see mentioned in this thread is how SQL Server tend to not deallocate data when you do DELETE on a heap. Since the use-case here is OLTP, I think it is worth mentioning.

Now, if the deletes are pretty rare, then this might not hurt you. But if you have a decent amount of delete, then be prepared that the space for the delete might not be deallocated and the table will just grow and grow, as if you hadn't deleted that data.

(I have a feeling that the design is that it was deemed impractical for a delete to: Check all pages for that extent if zero usage and if so, mark that extent as not used in the IAM page. Pure speculation from my side.)

I currently have a pretty extreme case. It is NServiceBus with SQLTransport. I.e. a queue-handling backbone for developers, implementing the storage in SQL Server. The normal state, at rest for a table os 0 rows. Rows are added when an entry is added to the queue, and then deleted when that row is handled (consumed) by the app.

I have many apps/databases and the prod environment is then replicated one way or another onto many non-prod environments. On one of those environments (SQL Server instances) and for one database I have one table using 40 GB data. That was with 0 rows in the table. A query similar to below took about 60 seconds:

SELECT TOP(1) col1, col2
FROM tbl
WHERE a_column_having_a_unique_index = 42

And, again, there are 0 rows on that table. One can dig into the execution plans and look for scans, but that isn't the point here.

So, I had to carefully handcraft a stored procedure to rebuild the heaps (and also the indexes to get rid of ghost LOB-data).

Care had to be taken to handle among other things blocking due to the high query load on these tables (being a queueing backbone for most of our applications). I.e., implement lock_timeout to bail if we waited more than x seconds (so de don't block the queue users when we in turn are blocked), parameterize it, monitor and tweak that timeout value etc.

I'm still working in my non-prod environments. I know that the prod environment has an even higher amount of dead space in the heaps, so the first rebuild has to be done manually, monitoring as it proceeds and being prepared to terminate. My environment manages to rebuild about 1 GB per sec, and the prod environment has about 400 GB dead space, spread over several databases and a handful of tables in each database. When SQL Server rebuilds, an X lock is required, which I might have to wait for, and all the other readers coming in will want S lock (in order to avoid live-lock, I guess). I can't have myself "hanging" the queue for more than a handful of seconds hence the need to be careful with manually monitoring the first run and having a decent timeout for the daily rebuilds that will be scheduled.

How much I wish that SQLTransport used a clustered table instead. Then I could have done something useful with that time instead.

I did consider things such as TRUNCATE TABLE and DELETE without where clause using TABLOCK (both which should deallocate non-used allocations). But I deemed that impractical since the queues has no downtime. Furthermore, I would have to be careful to first check if there are 0 rows and based on that do this type of action, in case something adds entries after my check and before my DROP/TRUNCATE/DELETE. That is why I in the end I decided for a non-destructive method.

What is the practical use case for a very large (terabyte-scale) heap?

2 Answers2

Downsides