247

I am currently creating a web application that allows users to store and share files, 1 MB - 10 MB in size.

It seems to me that storing the files in a database will significantly slow down database access.

Is this a valid concern? Is it better to store the files in the file system and save the file name and path in the database? Are there any best practices related to storing files when working with a database?

I am working in PHP and MySQL for this project, but is the issue the same for most environments (Ruby on Rails, PHP, .NET) and databases (MySQL, PostgreSQL).

B Seven
  • 3,155

13 Answers13

171

Reasons in favor of storing files in the database:

  1. ACID consistency including a rollback of an update which is complicated when the files are stored outside the database. This isn't to be glossed over lightly. Having the files and database in sync and able to participate in transactions can be very useful.
  2. Files go with the database and cannot be orphaned from it.
  3. Backups automatically include the file binaries.

Reason against storing files in the database:

  1. The size of a binary file differs amongst databases. On SQL Server, when not using the FILESTREAM object, for example, it is 2 GB. If users need to store files larger (like say a movie), you have to jump through hoops to make that magic happen.
  2. Increases the size of the database. One general concept you should take to heart: The level of knowledge required to maintain a database goes up in proportion to the size of the database. I.e., large databases are more complicated to maintain than small databases. Storing the files in the database can make the database much larger. Even if say a daily full backup would have sufficed, with a larger database size, you may no longer be able to do that. You may have to consider putting the files on a different file group (if the database supports that), tweak the backups to separate the backup of the data from the backup of the files etc. None of these things are impossible to learn, but do add complexity to maintenance which means cost to the business. Larger databases also consume more memory as they try to stuff as much data into memory as possible.
  3. Portability can be a concern if you use system specific features like SQL Server's FILESTREAM object and need to migrate to a different database system.
  4. The code that writes the files to the database can be a problem. One company for whom I consulted not so many moons ago at some point connected a Microsoft Access frontend to their database server and used Access' ability to upload "anything" using its Ole Object control. Later they changed to use a different control which still relied on Ole. Much later someone changed the interface to store the raw binary. Extracting those Ole Object's was a new level of hell. When you store files on the file system, there isn't an additional layer involved to wrap/tweak/alter the source file.
  5. It is more complicated to serve up the files to a website. In order to do it with binary columns, you have to write a handler to stream the file binary from the database. You can also do this even if you store file paths but you don't have to do this. Again, adding a handler is not impossible but adds complexity and is another point of failure.
  6. You cannot take advantage of cloud storage. Suppose one day you want to store your files in an Amazon S3 bucket. If what you store in the database are file paths, you are afforded the ability to change those to paths at S3. As far as I'm aware, that's not possible in any scenario with any DBMS.

IMO, deeming the storage of files in the database or not as "bad" requires more information about the circumstances and requirements. Are the size and/or number of files always going to be small? Are there no plans to use cloud storage? Will the files be served up on a website or a binary executable like a Windows application?

In general, my experience has found that storing paths is less expensive to the business even accounting for the lack of ACID and the possibility of orphans. However, that does not mean that the internet is not legion with stories of lack of ACID control going wrong with file storage but it does mean that in general that solution is easier to build, understand and maintain.

Thomas
  • 1,825
99

In many cases, this is a bad idea. It will bloat the database files and cause several performance issues. If you stick the blobs in a table with a large number of columns it's even worse.

However! Some databases, like SQL Server have a FILESTREAM column type. In this case, your data is actually stored in a separate file on the database server and only an ID to the file is saved in the table. In this case I don't see much of a reason not to keep the data in the SQL server. The files are automatically included as part of the server backup, and the database and the files are never out of sync. The problem with Tony's suggestion of storing file names, is that the database and the filesystem can get out of sync. The database will claim a file exists when it's been deleted on disk. If a process is modifying the database and then crashes, the files and the database will not match (i.e. no ACID with files outside of a database).

44

Yes, it is a bad practice.

Performance impact on the DB:

  • if you do a SELECT with any BLOB column, you will always do a disk access, while without BLOBs you have a chance to get data straight from RAM (high throughput DB will be optimized to fit tables in RAM);
  • replication will be slow, replication delay high, as it will have to push BLOB to slaves. High replication delay will be causing all kinds of race conditions and other synchronization problems, unless you explicitly take that in account;
  • DB backups/restore will take lot longer;

Speed advantage — none! While some older filesystems would not handle well directories with millions of files, most modern have no problem at all and in fact use same kind of data structures as BDs (typically B-trees). For example ext4 (default Linux filesystem) uses Htree.

Conclusion: it will hinder your DB performance and will not improve file retrieval performance.

Also, since you're talking about web application — serving static files directly from filesystem using modern webserver, which can do sendfile() syscall is tremendous performance improvement. This is of course not possible if you're fetching files from DB. Consider for example this benchmark, showing Ngnix doing 25K req/s with 1000 concurrent connections on a low end laptop. That kind of load would fry any kind of DB.

vartec
  • 20,846
28

I would be pragmatic about it, and follow the "don't optimize yet" principle. Make the solution that makes sense at the moment, and one that you have the development resources to properly implement. There are plenty of potential problems. But those do not necessarily become real problems. E.g. It would probably not be a problem if you have 100 users. It might be a problem if you have 100,000 or 10,000,000 users. But in the latter case, there should be a basis for more development resources to deal with all the issues.

But storing the data in the database does relieve you from dealing with other problems, e.g. where should the files be stored, how should they be backed up, etc. Since you are writing a web application it would be a very good idea for security reasons to make sure that the process hosting the application does not have write access to the file system, so you need to configure the server so that process has read/write access to the folder where data is stored.

I would personally choose to store the data in the database, but make sure that the BLOBS are not read until they are really needed, i.e. no "SELECT * FROM ..." executed on those tables containing blogs. And I would make sure that the design makes it easy to move the data out of the database, into the filesystem, if you do get performance problems. For example store the file information in a separate File table, thus keeping the file information away from other business entities.

Assuming that you have a File class for representing a file read in the database, then the coding impact of later moving it out will be minimal.

Pete
  • 9,016
19

Microsoft released a white paper about this a few years back. It concentrates on SqlServer, but you may find some interesting information in there:

To BLOB or not to BLOB? Large Object Storage in a Database or a Filesystem?

A very concise version of their conclusion is:

When comparing the NTFS file system and SQL Server 2005, BLOBS smaller than 256KB are more efficiently handled by SQL Server, while NTFS is more efficient for BLOBS larger than 1MB.

I would recommend that you write some small tests for your particular use case. Bear in mind that you have to beware of caching effects. (I was amazed the first time I got save-to-disk speeds that seemed to have higher throughputs than was physically possible!)

Benjol
  • 3,737
14

The old conventional wisdom of storing files outside database might no longer hold. As a matter of principle, I'd favor integrity over speed, and with a modern DBMS, you can have both.

Tom Kyte seems to agree:

I know of no advantages to storing data I want to keep for a long time outside of a database.

If it is in the database I can

be sure it is professionally managed

backed up

recoverable (with the rest of the data)

secured

scalable (try putting 100,000 documents in a single directory, now, put them in table - which one 'scales' - it is not the directory)

I can undelete (flashback) easily

I have locking

I have read consistency...

gnat
  • 20,543
  • 29
  • 115
  • 306
8

Yes.

If you serve a file from your filesystem, your Web server can use kernel code like sendfile() on BSD or Linux to copy the file directly to the socket. It's very fast and very efficient.

Serving files out of the database means you have to copy data from the database server's disk to database server memory, then from db server's memory to the db server's network port, then in from the network to your Web server process, then out again to the outgoing network connection.

Unless you have a really good reason not to, it's always better to serve static files from the file system.

Evan P.
  • 181
6

It's usually best to store large BLOBs in a separate table and just keep a foreign key reference to the BLOB in your main table. That way, you can still retrieve the file from the database (so you don't need any special code) and you avoid the problems surrounding external DB dependencies (keeping the DB and filesystem in sync, etc), but you only incur that overhead if you explicitly join to that table (or make a separate call). 10MB isn't terribly large, most modern commercial databases won't have a problem. The only reason I'd store a file in the filesystem is to cut down on database bandwidth. If your database is going to be shuffling a lot of these files, then you may need to split the workload and only store a file descriptor of some sort. Then you can have a separate call to load the file from another server, so you aren't tying up your database connections (and network connections on your database server) with all those file transfers.

TMN
  • 11,383
5

Famous Tom Kyte has written that they (the Oracle) are using the Oracle database as file server and it's working perfectly fine, even faster that normal filesystem, with full transactionality, no performance loss and with single backup.

Yes, but note, they are the producer of the Oracle DB, and for any other user there are cost issues. Using commercial DB such as Oracle for storage of files is simply cost ineffective.

However, with PostgreSQL for example, you can simply run another DB instance only for blob storage. You have then full transactional support. But transactionality costs DB space. There is the need for database to store multiple blob instances for multiple concurrent transactions. On PostgreSQL it is the most painful, since this database stores the duplicates of blobs made for transaction are stored even if they are not needed anymore, until VACUUM process is done.

With filesystem storage, on the other hand, you must be very carefull when someone modifies the file, because transaction can be rolled back and the copy of the file must be kept until the old version is no longer visible.

In the system where files are only added and deleted, and transactional access to files is not an issue, the filesystem storage will be IMHO the best choice.

4

You might run into some of this problems:

  • Doing a SELECT * which involves the row with the large blob takes very long, even if you don't need the blob (Of course you should do a specific select, but sometimes applications are written like this)
  • Doing a backup can take much longer. Depending on your needs you may need to lock your tables for the time of the backup, so you may want to keep your backup time low
  • Restoring will also take much more time.
  • If you run out of space, you have to think of some way (maybe moving the whole database to a new server) to solve this problem. Storing the files on the file system you can always mount another hard drive and set soft links.
  • Simply looking into a file for debugging or other information is not as easy. This also includes scripts which might not have access to the database but need some information from various files.

Of course you also get some benefits:

  • Backing up data and file menas they are in sync
  • Removing the file without the database knowing is not possible
  • You don't have to read the file from disk but can do it in one sql statement
  • You can download the database, include the dump into your development environment and have all dependencies right there

Personally I don't do it as I find the cons much heavier than the pros. But as stated above it totally depends on your use case and such.

1

Some Enterpirse Content Management Systems, like SiteCore, are using one database to store page data and another database to store files. They're using MS SQL Server.

šljaker
  • 476
1

For practical implementation, here are what you may concern:

Benifits:

  1. All file contents are definitely synchronized with your table. As comments above said, backing up data is totally convenient as you don't need to keep data synchronized with the file system.
  2. From coding, you can get file content directly from a SQL select.
  3. From a query, you can even filter file content or its size explicitly from SQL statement.

Downsides:

  1. Compared to a databased of which structure is semantically the same but does not store file content, you database tends to consume radically more memory when doing query.
  2. Auto backup can cause performance problem but not much. Let's imagine your database server is backing up things every 6 hours and those databases you have are storing 10-MB file per record. That scenario is not what you want.
0

This is one of those "Can I use a razor blade to cut open an apple?" Yes you can.

Should you? Who is to tell...

I guess when you find yourself in the situation when yor tooling allows it or is the only tooling available (at least to get that deadline). For example I have used a flat screwdriver to remove a philips bolt type... was that correct? was that the right tool? Was it a bad choice?

The answer to this case is: the DB is not supposed to store files... whenever you are doing it is wrong, the same way theoretically I shouldn't have used the flat screwdriver to remove philips bolt, because while doing so I lost the benefit of the philips screwdriver not getting out of place and damaging whatever I am working on (as you would loose benefits if you use a DB as file storage solution)... yet I properlly calculated the risk and everything went just fine.

if you would like to do things right you should probably use git to store your files and just keep the necesary git hashes in your DB to refer to the correct file version you need... the same way I was able to walk to my tool box and get the damn philips screwdriver...

Ordiel
  • 189