6

I'm trying to understand better how writing to NVMe can be optimized.

I have a process which writes a large amount of data to disk (~100 gb) in one batch job. The data is spread across 100s of files.

I have multiple NVMe disks available and will be writing separate files in parallel in one process using multiple threads (~10).

Currently all the data is written to a single NVMe disk, I am wondering if a potential performance improvement could be achieved by writing the data across multiple disks in parallel or if a single NVMe device can handle parallel writes on it's own?

Appreciate any input

R.Smith
  • 61

2 Answers2

4

Instead of the usual disclaimer, let’s start with a quote from Albert Einstein: “In theory, theory and practice are the same. In practice, they are not.”

Theory NVMe drives are supposed to handle multiple parallel writes just fine, so you don’t need to optimize anything yourself.

Practice Not all NVMe drives are made equal. Enterprise ones come with powerful CPUs for running garbage collection firmware, deep I/O queues, huge RAM write buffers, and SLC flash cells for durability. Consumer-grade NVMe drives? Not so much!

So, here’s the trick... Forget the basic rule of “no log-on-log” (check the link below), and instead of bombarding your NVMe with tons of small writes from different processes, create a queue. Let one process gather all those small writes into a bigger one and handle them one at a time. This will a) boost performance, and b) extend the life of your NVMe.

P.S. Here’s the link to “no log-on-log”:

https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf

NISMO1968
  • 1,583
0

Well certainly the NVMe spec allows for something like 64k queues, each can contain 64k entries - so the spec itself allows for some very high-end concurrency operations - whether or not the actual drive supports that many is a different thing but certainly they'll support hundreds of queues and hundreds of queue entries. All of this will help.

That said the instinct is that more drives will be better but it really does depend on how flooded the PCIe/PCH is - if there's lots of little writes then there's a chance that the bus isn't fully busy so smearing across multiple disks may help a little, but if it's fully utilised, as you'd likely see in large sequential writes, then a single disk could be as quick as multiple disks.

Of course this is all before you consider things like RAID 1/10 or 0 and the impact that would have - you ideally want something like that for resilience.

Ultimately you should test this, that's the proper answer but it depends how many drives and how much time you have to do the testing - but that's the only way to truly know.

Chopper3
  • 101,808