why read is faster when using O_DIRECT flag?

Question

I copied a 10GB of file in my SSD which has read bandwidth of around 3.3GB/s, benchmarked using fio command. Here is the ref: https://cloud.google.com/compute/docs/disks/benchmarking-pd-performance

I cleared out the cache using this "sync; echo 3 > /proc/sys/vm/drop_caches". After that I tried to read the file in small chunks of 3MB every time using system calls open() and read(). If I open the file without O_DIRECT and O_SYNC it gives me a bandwidth of around 1.2GB/s. However, If I use O_DIRECT and O_SYNC it gives me a bandwidth of around 3GB/s. Clearing the cache both times even O_DIRECT doesn't really use the page cache.

My question is why O_DIRECT is giving normal IO bandwidth and without O_DIRECT I cant get it. As the data going from IO to the page cache has bandwidth of 3.3GB/s and from page cache to user buffer is around 7GB/s i suppose. This pipeline should also give normal 3.3GB/s. Why it is slower?

I am always reading a new 3MB every time. I am not reusing the data so cache is not really useful. But the pipeline should be bound by IO why it is not?

CPU is Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz. I am not sure about the DRAM speed. But the thing is if i re-read the same 3MB multiple times then i get ~8GB/s bandwidth. Which should be the DRAM bandwidth i suppose. Because linux can use all of the free RAM as page cache.

Update

I tried the fio command with and without O_DIRECT enabled in it and logged the iostat.

Used this fio command. "fio --name=read_throughput --directory=$TEST_DIR --numjobs=1 --size=10G --time_based --runtime=30s --ramp_time=0s --ioengine=sync --direct=0 --verify=0 --bs=4K --iodepth=1 --rw=read --group_reporting=1 --iodepth_batch_submit=64 --iodepth_batch_complete_max=64"

Used this iostat.

"iostat -j ID nvme0c0n1 -x 1"

I had the following conclusion, a single threaded read without O_DIRECT flag is not able to saturate the SSD with enough read requests to achieve 3.3GB/s irrespective of the block size being used. However, with O_DIRECT falg a single threaded read is able to saturate the device when block size is 64M or higher. At 3M it is around 2.7GB/s.

Now the question is why without O_DIRECT flag, CPU is not able to send enough read requests to SSD why it is limiting them? Does it has to do with cache management limitation? If yes, which parameter is limiting it? Can I change it and see does it affect amount of read requests being sent to the device?

Grant Curell · Accepted Answer · 2023-08-13T22:03:20.457

O_DIRECT is faster than a generic read because it bypasses the operating system's buffers. You are reading directly from the drive. There are a couple of reasons this could be faster though keep in mind at this level things get insanely setup specific. Example of what I mean by setup specific factors: if you have a drive that is optimized for 8kB writes inside the NAND vs 4kB chunks and you write/read at the wrong size you'll see half performance but this requires you to have an internal understanding of how drives work. This can even vary within the same model - ex: the A model of a drive might have different optimizations than the same B model of the drive (I have seen this multiple times in the field)

But back to your question:

No cache to have to copy in and out of
If you're doing something like FIO you'll get more predictable read behavior
1MB is a large block size so you'll benefit extra from not dealing with the cache

Beyond that you have to start getting deeper into benchmarking and that's a pretty complex topic.

My general recommendation is to start with io_stat. Is avgqu-sz high? Is util close to 100%, it probably will be if you're close to the drive's max capacity. Is await long? Do you have RAID? What scheduling algorithm did you pick? The list of things I've seen cause these sorts of things are myriad and figuring out exactly what causes which behavior is going to be very unique to your specific system.

What I said in the beginning though may get you in the ballpark. Best guess is that if you are doing big block reads you're getting savings regarding some sort of cache inefficiency.

why read is faster when using O_DIRECT flag?

Update

1 Answers1