4

I have a AMD EPYC 7502P 32-Core Linux server (kernel 6.10.6) with 6 NVMe drives, where suddenly I/O performance dropped. All operations takes too much time. Installing package updates takes hours instead of seconds (maybe minutes).

I've tried running fio on filesystem with RAID5. There's a huge difference in clat metric:

    clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05

stdev value is extreme.

full output:

$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=posixaio, iodepth=1
fio-3.33
Starting 1 process
random-write: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [F(1)][100.0%][w=53.3MiB/s][w=13.6k IOPS][eta 00m:00s]
random-write: (groupid=0, jobs=1): err= 0: pid=48391: Wed Sep 25 09:17:02 2024
  write: IOPS=45.5k, BW=178MiB/s (186MB/s)(10.6GiB/61165msec); 0 zone resets
    slat (nsec): min=552, max=123137, avg=2016.89, stdev=468.03
    clat (nsec): min=190, max=359716k, avg=16112.91, stdev=592031.05
     lat (usec): min=10, max=359716, avg=18.13, stdev=592.03
    clat percentiles (usec):
     |  1.00th=[   11],  5.00th=[   12], 10.00th=[   14], 20.00th=[   15],
     | 30.00th=[   15], 40.00th=[   15], 50.00th=[   15], 60.00th=[   16],
     | 70.00th=[   16], 80.00th=[   16], 90.00th=[   17], 95.00th=[   18],
     | 99.00th=[   20], 99.50th=[   22], 99.90th=[   42], 99.95th=[  119],
     | 99.99th=[  186]
   bw (  KiB/s): min=42592, max=290232, per=100.00%, avg=209653.41, stdev=46502.99, samples=105
   iops        : min=10648, max=72558, avg=52413.32, stdev=11625.75, samples=105
  lat (nsec)   : 250=0.01%, 500=0.01%, 1000=0.01%
  lat (usec)   : 10=0.01%, 20=99.15%, 50=0.76%, 100=0.03%, 250=0.06%
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 500=0.01%
  cpu          : usr=12.62%, sys=30.97%, ctx=2800981, majf=0, minf=28
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,2784519,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs): WRITE: bw=178MiB/s (186MB/s), 178MiB/s-178MiB/s (186MB/s-186MB/s), io=10.6GiB (11.4GB), run=61165-61165msec

Disk stats (read/write): md1: ios=0/710496, merge=0/0, ticks=0/12788992, in_queue=12788992, util=23.31%, aggrios=319833/649980, aggrmerge=0/0, aggrticks=118293/136983, aggrin_queue=255276, aggrutil=14.78% nvme1n1: ios=318781/638009, merge=0/0, ticks=118546/131154, in_queue=249701, util=14.71% nvme5n1: ios=321508/659460, merge=0/0, ticks=118683/138996, in_queue=257679, util=14.77% nvme2n1: ios=320523/647922, merge=0/0, ticks=120634/134284, in_queue=254918, util=14.71% nvme3n1: ios=320809/651642, merge=0/0, ticks=118823/135985, in_queue=254808, util=14.73% nvme0n1: ios=316267/642934, merge=0/0, ticks=116772/143909, in_queue=260681, util=14.75% nvme4n1: ios=321110/659918, merge=0/0, ticks=116300/137570, in_queue=253870, util=14.78%

Probably one disk is faulty, is there a way how to determine the slow disk?

All disks have similar SMART attributes, nothing outstanding. SAMSUNG 7T:

Model Number:                       SAMSUNG MZQL27T6HBLA-00A07
Firmware Version:                   GDC5902Q
Data Units Read:                    2,121,457,831 [1.08 PB]
Data Units Written:                 939,728,748 [481 TB]
Controller Busy Time:               40,224
Power Cycles:                       5
Power On Hours:                     6,913

write performance appears to be very similar:

iostat -xh
Linux 6.10.6+bpo-amd64 (ts01b)  25/09/24        _x86_64_        (64 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle 5.0% 0.0% 4.3% 0.6% 0.0% 90.2%

 r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz Device
0.12      7.3k     0.00   0.0%    0.43    62.9k md0

6461.73 548.7M 0.00 0.0% 0.22 87.0k md1 3583.93 99.9M 9.60 0.3% 1.13 28.5k nvme0n1 3562.77 98.9M 0.80 0.0% 1.15 28.4k nvme1n1 3584.54 99.8M 9.74 0.3% 1.18 28.5k nvme2n1 3565.96 98.8M 1.06 0.0% 1.16 28.4k nvme3n1 3585.04 99.9M 9.78 0.3% 1.16 28.5k nvme4n1 3577.56 99.0M 0.86 0.0% 1.17 28.3k nvme5n1

 w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz Device
0.00      0.0k     0.00   0.0%    0.00     4.0k md0

366.41 146.5M 0.00 0.0% 14.28 409.4k md1 8369.26 32.7M 1.18 0.0% 3.73 4.0k nvme0n1 8364.63 32.7M 1.12 0.0% 3.63 4.0k nvme1n1 8355.48 32.6M 1.10 0.0% 3.56 4.0k nvme2n1 8365.23 32.7M 1.10 0.0% 3.46 4.0k nvme3n1 8365.37 32.7M 1.25 0.0% 3.37 4.0k nvme4n1 8356.70 32.6M 1.06 0.0% 3.29 4.0k nvme5n1

 d/s     dkB/s   drqm/s  %drqm d_await dareq-sz Device
0.00      0.0k     0.00   0.0%    0.00     0.0k md0
0.00      0.0k     0.00   0.0%    0.00     0.0k md1
0.00      0.0k     0.00   0.0%    0.00     0.0k nvme0n1
0.00      0.0k     0.00   0.0%    0.00     0.0k nvme1n1
0.00      0.0k     0.00   0.0%    0.00     0.0k nvme2n1
0.00      0.0k     0.00   0.0%    0.00     0.0k nvme3n1
0.00      0.0k     0.00   0.0%    0.00     0.0k nvme4n1
0.00      0.0k     0.00   0.0%    0.00     0.0k nvme5n1

 f/s f_await  aqu-sz  %util Device
0.00    0.00    0.00   0.0% md0
0.00    0.00    6.68  46.8% md1
0.00    0.00   35.24  14.9% nvme0n1
0.00    0.00   34.50  14.6% nvme1n1
0.00    0.00   33.98  14.9% nvme2n1
0.00    0.00   33.06  14.6% nvme3n1
0.00    0.00   32.33  14.8% nvme4n1
0.00    0.00   31.72  14.6% nvme5n1

sort of problematic appears to be interrupts

$ dstat -tf --int24 60
----system---- -------------------------------interrupts------------------------------
     time     | 120   128   165   199   213   342   LOC   PMI   IWI   RES   CAL   TLB 
25-09 10:53:45|2602  2620  2688  2695  2649  2725   136k   36  1245  2739   167k  795 
25-09 10:54:45|  64    64    65    64    66    65  2235     1    26    16  2156     3 
25-09 10:55:45|  33    31    32    32    32    30  2050     1    24    10  2162    20 
25-09 10:56:45|  31    31    30    35    30    33  2303     1    26    63  2245     9 
25-09 10:57:45|  36    29    27    34    35    35  2016     1    23    72  2645    10 
25-09 10:58:45|   9     8     9     8     7     8  1766     0    27     4  1892    15 
25-09 10:59:45|  59    62    59    58    60    60  1585     1    22    20  1704     9 
25-09 11:00:45|  25    21    21    26    26    26  1605     0    26    10  1862    10 
25-09 11:01:45|  34    32    32    33    36    31  1515     0    23    24  1948    10 
25-09 11:02:45|  21    23    23    25    22    24  1772     0    27    27  1781     9 

the fields with increased interrupts are mapped to 9-edge to all drives nvme[0-5]q9, e.g.:

$ cat /proc/interrupts | grep 120:
IR-PCI-MSIX-0000:01:00.0    9-edge      nvme2q9

EDIT: The 9-edge is probably Metadisk (Software RAID) devices.

Tombart
  • 2,523

1 Answers1

3

The issue was probably caused by a malfunctioning connector, after reconnecting all drives and checking cables, the random write benchmark looks ok, clat max value is in "normal" range 5723.2k.

$ fio --name=random-write --ioengine=posixaio --rw=randwrite --bs=4k --numjobs=1 --size=4g --iodepth=1 --runtime=60 --time_based --end_fsync=1
random-write: (groupid=0, jobs=1): err= 0: pid=1557777: Fri Sep 27 09:45:32 2024
  write: IOPS=54.9k, BW=214MiB/s (225MB/s)(13.0GiB/62177msec); 0 zone resets
    slat (nsec): min=481, max=426106, avg=1293.48, stdev=595.80
    clat (nsec): min=171, max=5723.2k, avg=12235.29, stdev=8655.24
     lat (usec): min=10, max=5724, avg=13.53, stdev= 8.80
Tombart
  • 2,523