How to store queryable 10-100MB BLOBs?

Question

I have read several discussions about storing BLOBs in the database vs in an object storage. What I need in addition though is a functionality for querying these BLOBs. The BLOBs will be immutable compressed multi-dimensional arrays that will be 10s of MBs. I mention that, since they are not files that will be served, but rather objects that will be used for some heavy computation outside of the database.

Let me illustrate that by presenting the intended DB schema with a small example, assuming we will be storing them in a relational database.

items

| id       | price          | company_id    |
| -------- | -------------- |-------------- |
| 1        | 5              |1              |
| 2        | 25             |1              |

compute_info

| id       | item_id        | type     | info         |
| -------- | -------------- | ---------|--------------|
| 99       | 1              | simple   | [large-blob] |
| 231      | 1              | complex  | [large-blob] |
| 77       | 2              | simple   | [large-blob] |
| 449      | 1              | complex  | [large-blob] |

I want to be able to construct complex query that will return only the info for items that match certain criteria, e.g.

SELECT items.id, compute_info.info
FROM items
JOIN compute_info
ON items.id = compute_info.item_id
WHERE items.company_id = 1
AND items.price > 20
AND compute_info.type = simple

Some of the assumptions and other requirements are:

The query SELECT COUNT(*) FROM items WHERE company_id = ? will usually return <1 million.
The BLOB content itself will not be queried.
There might be multiple concurrent users. There might be 1000s of them in total, but 100s who will actually be reading the BLOBs.
There are no fixed requirements in terms of database choice or object storage. The goal is to pick the best one that solves this particular problem.
We can assume a distributed architecture.

The question is what is the proper way to store these BLOBs if performance is critical without introducing too much complexity? The options that I have considered are:

Store them as BLOBs in the database, just as illustrated in the example. This has the advantage that I could use a JOIN directly.
The info column stores a path to an object storage. This has the problem that we'd need to access a couple of thousand separate files.
Use something like TensorStore given that the application is ML. I don't have any experience with this, but I'd still need efficient random access.
Some other approach.

Doc Brown · Answer 1 · 2023-01-30T14:01:07.933

You wrote

What I need in addition though is a functionality for querying these BLOBs.

but then in a comment something which should have been part of your question, because it makes an important difference:

The data inside the BLOBs won't be queried.

So as long as you write a query which just returns non-BLOB data, with any sane relational DB system it should not make a huge difference whether the BLOBs are stored directly inside the DB or outside. The definition of indexes on the columns which might be used in the WHERE condition of your queries is not affected by the storage of the BLOBs.

Now, when it comes to getting the BLOBs themselves out of the database again, query time will be mostly a result of how quick the non-BLOB part of the query is executed plus the time to transfer the BLOB data over the network, which is mostly dependend on the amount of data to be transfered. In theory, this is still not dependend from where the BLOB data is stored, in the database or on some other server or storage media, as long as the same network between server and client is used in the middle. At least, the "Big-O" running time order of your queries will be the same.

In reality, however, you will notice performance differences. But these are heavily dependend on a lot of low-level details: which DBMS you are using (and how well it is optimized for BLOB storage, which may differ from version to version, or dependent on specific table storage configuration), what kind of external storage system you have available, and where it is located in the network between the DB and the client. It may also depend on implementation details of your client software, and how much optimization effort you invest.

So, my best recommendation here is to RTFM of your DBMS and see what it tells you about BLOB storage and how to optimize it. Make some tests, and when it turns out you don't reach the performance you like to get, then try out something more sophisticated.

Addendum: somewhere in a comment you mentioned PostgreSQL. Here is a full Wiki page about the pros and cons of storing binary files inside or outside a PostgreSQL database. Hope this helps.

score 2 · Answer 2 · answered Jan 30 '23 at 17:27

I want to be clear that this is not a black-and-white situation. There are a lot of factors here and some of them are not strictly technical.

Let's start with the pros of keeping the BLOBs in DBMS:

Simplicity - both the relation data and non-relational data is in the same place with the same access controls

Now we can talk about the potential cons:

Constraints on storage options
Increased load on the DB
DB backups slower, more space required
Horizontal scaling challenges

There really isn't much advantage to storing binary data in a relational database. Sometimes this is done to allow the data in the LOBs to be queried from within the database (not something I recommend either) but that's not relevant here. Essentially, BLOB storage is a way to stuff some bytes in a place where it doesn't really belong and use the DB as a glorified file system. It's not always a bad idea, if you have a few LOBs that you need to store, it might not be worth the trouble of finding another solution.

When you lock yourself into that as a solution, however, it will limit your options. For example, if you put the blobs in some sort of distributed storage such as S3, you have a lot more options. For example, if your users are located around the world, you can replicate these BLOBs regionally with very little effort. Your storage costs are also likely to be a lot cheaper than putting them on the disk that supports the database.

If you keep the LOB data in the DB, that DB now can potentially become a bottleneck of the solution. If a number of BLOB references are returned by one query, can you easily pull them down in parallel? If you have 100 clients pulling 100 BLOBs from the same DB at the same time, what does that mean for the requirements of the DB server? Much of that depends on the DB you choose. This is the kind of situation that undermines the simplicity of keeping everything on the DB.

It's also often the case that the RDBMS has the heaviest load of any component of a system. I therefore prefer to avoid putting things on the DB when there's no specific advantage to doing so. On the other hand, there are costs involved in adding complexity to a system.

One option here is to keep URIs in the table instead of BLOB pointers. Where that URI actually resolves to is abstract. You could still keep the BLOBs in by setting up a simple HTTPS server in front of them or you could put them in any other kind of storage you like, even in memory. But let's assume you kept them in the DB and you start to have issues because of that. You could create a separate DB instance for the BLOBs as a quick fix and/or move to some other solution without disrupting the clients.

Another thing to consider is that you might even be able to store some, if not all of the LOB data local to the clients. This might just be a local cache which could work with any option here, including BLOBs in the DB. As long as the data really doesn't change, there's no need to repeatedly transfer them across the network.

How to store queryable 10-100MB BLOBs?

2 Answers2