Reading from software RAID in parallel is slower than it "should" be, given other benchmarks

Question

Note: I have seen some somewhat similar questions here, but:

none of them concern reading many files in parallel, and
most are 10+ years old and concern no-longer-relevant hardware and kernel versions.

Background:

I have a Linux machine (Ubuntu, kernel 5.15.0-82-generic) with two 128-core CPUs and 128 GB of RAM. It has a RAID 5 array of 10 SSDs connected over SATA, each rated to be read at up to 530 MB/s. These are effectively read-only when in use (they are used primarily during the day and new data is added each night). My general problem is to supply dozens of cores with data from the disks in parallel.

Benchmarking procedure

I am benchmarking reads by running instances of

dd if=/path/to/large/file of=/dev/null bs=1024 count=1048576

in parallel with iostat and iotop. Between runs I clear out the cache by running

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

I am confident this works correctly because if I don't do it then a subsequent read of a hot file finishes nearly immediately, and then once I do it reading that file goes back to the same performance as before.

Benchmark results

If I read a single file over software RAID, I get a read rate of somewhere between 500 - 700 MB/s, and looking at the output of iostat I see that the way this is accomplished is that it's reading from each of the ten drives at basically exactly the same speed in parallel.

If I read from the drives directly (i.e. if I supply /dev/sda, /dev/sdb, etc. as the if= arguments to dd), then I am able to read from each of them in parallel at 530MB/s each (i.e. reading 1GB from all ten takes exactly the same amount of time as reading 1GB from a single one of them.)

However, if I try to read multiple files in parallel over software RAID, I get very substantial performance degradation. If I read ten files in parallel over software RAID then the individual files achieve read speeds between 150 and 350 MB/s, and the entire process takes something like 4x as long as copying the same amount of data directly from the drives.

Moreover, reading from software read seems to hit an absolute wall at a total read speed of around 2.7 GB/s as reported by iotop.

I think in order to feed all the cores with enough disk data that they're not wasted I'll probably need to move to NVMe instead of SATA, but I want to resolve this issue first because it seems like either the software RAID or something upstream of it is putting a cap on the speed at which I can read from these disks.

Questions:

How can I diagnose where the bottleneck is?
How do I even look at the configuration options here, and what are my other options?
Are there fundamental limitations of my setup that make what I'm trying to do impossible? If so are there alternate configurations I could use?

Stuff I have already tried

Playing with the block size of dd, making it either larger or smaller, has no effect.
Setting RAID read-ahead and/or stripe-cache size has no effect
Upgrading the kernel to a slightly newer version that apt wanted to dramatically hurt the benchmark results, basically capping total throughput at 500 MB/s IIRC.

Addendum:

Sample output from iostat -k 1 during a benchmark run: https://pastebin.com/yuWwWbRU

Contents of /proc/mdstat:

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md127 : active raid5 sdj1[10] sdh1[7] sdi1[8] sdf1[5] sdd1[3] sdc1[2] sdg1[6] sde1[4] sdb1[1] sda1[0]
      70325038080 blocks super 1.2 level 5, 4k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      bitmap: 0/59 pages [0KB], 65536KB chunk
unused devices: <none>

Reading from software RAID in parallel is slower than it "should" be, given other benchmarks

0 Answers0