Why Swap usage is high for Influxdb: 100% Disk I/O and Swap usage but only 50% memory?

Question

We have an influxdb VM that is constantly under 100% swap. Even if we restart the VM, the swap usage reaches 100% in about 20 minutes. However, memory usage is only about 50%. (The VM has 32 CPU Cores and 128 GB of Memory.)

Running free -h:

               total        used        free      shared  buff/cache   available
Mem:           123Gi        70Gi       567Mi       551Mi        52Gi        59Gi
Swap:            9Gi         9Gi          0B

Shows that we have at least 59GB of memory and 100% of the swap is still used.

If we run atop we see that the disk is 100% busy (swap and disk are red)

SWP |  tot    10.0G |               |  free    0.0M |  swcac 505.9M
DSK |       nvme2n1 |  busy    100% |  read   33115 |  write    527 |  discrd     0 |  KiB/r     19 |  KiB/w    173  |               | KiB/d      0  | MBr/s   63.3  | MBw/s    8.9  | avq    88.19  | avio 0.30 ms

This I'm guessing is the constant inflow of data-events.... (But why is read high then?)

Memory and I/O pressure from PSI:

cat /proc/pressure/memory
some avg10=32.65 avg60=32.74 avg300=31.25 total=35534063966
full avg10=32.25 avg60=32.34 avg300=30.87 total=35182532561
cat /proc/pressure/io
some avg10=84.83 avg60=78.83 avg300=78.96 total=70337558807
full avg10=84.38 avg60=78.05 avg300=78.08 total=69619870053

Memory pressure doesn't seem high but IO pressure is.

Running iotop it is clear that the disk activity is from influxdb:

4272 be/3 root        0.00 B/s   94.47 K/s  ?unavailable?  [jbd2/nvme2n1p1-8]
  36921 be/2 vcap     1169.95 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  36927 be/2 vcap      323.37 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  36928 be/2 vcap     2038.33 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  36941 be/2 vcap     1936.59 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
  37020 be/2 vcap      385.14 K/s    0.00 B/s  ?unavailable?  influxd -config /var/vcap/jobs/influxdb/config/influxdb.conf -pidfile /var/vcap/sys/run/influxdb/influxdb.pid
.
.
.
.
.
.
.
.
(Lots of influx threads)

SAR output

sar -d 10 6
Linux 6.2.0-39-generic (ac2f95dd-14d9-4eed-8e2f-060615e24dce)   03/24/2024      _x86_64_        (32 CPU)
06:45:57 AM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
06:46:07 AM   nvme1n1      0.30     12.80      1.60      0.00     48.00      0.00      1.33      0.12
06:46:07 AM   nvme0n1      0.30      0.00      3.20      0.00     10.67      0.00      1.00      0.12
06:46:07 AM   nvme2n1   3420.80  67438.40   3687.20      0.00     20.79    106.47     31.13    100.00
06:46:07 AM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
06:46:17 AM   nvme1n1      1.00      0.00      9.20      0.00      9.20      0.00      0.90      0.16
06:46:17 AM   nvme0n1      0.90     16.00      9.60      0.00     28.44      0.00      0.67      0.20
06:46:17 AM   nvme2n1   3404.80  68434.40   7868.00      0.00     22.41    102.23     30.03    100.00
06:46:17 AM       DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
06:46:27 AM   nvme1n1      9.70     26.40     20.40      0.00      4.82      0.02      1.69      1.24
06:46:27 AM   nvme0n1      0.30      0.00      4.40      0.00     14.67      0.00      0.67      0.08
06:46:27 AM   nvme2n1   3215.40  46037.20  12006.40      0.00     18.05     66.12     20.56    100.00
^C
Average:          DEV       tps     rkB/s     wkB/s     dkB/s   areq-sz    aqu-sz     await     %util
Average:      nvme1n1      3.67     13.07     10.40      0.00      6.40      0.01      1.61      0.51
Average:      nvme0n1      0.50      5.33      5.73      0.00     22.13      0.00      0.73      0.13
Average:      nvme2n1   3347.00  60636.67   7853.87      0.00     20.46     91.61     27.37    100.00

Running queries in influxdb:

It seems like this swap issue is even when queries arent running?

> show queries
qid query        database duration status
--- -----        -------- -------- ------
265 SHOW QUERIES metrics  53µs     running

vmstat output:

vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0 32 10485756 541300   8784 108148928   11  140  3563   194   76  217  1  1 58 40  0
 0 32 10485756 638500   8764 108060800    0    0 128216    60 5181 3351  0  1 59 40  0
 1 31 10485756 505964   8780 108189872    0    0 128252   256 5077 3769  0  1 54 45  0
 0 32 10485756 663736   8744 108035424    0    0 128332     0 5047 3327  0  1 50 50  0
 0 32 10485756 536476   8752 108164376    0    0 127776    24 4087 3335  0  0 53 46  0

/proc/meminfo is

MemTotal:       129202084 kB
MemFree:          486060 kB
MemAvailable:   71279440 kB
Buffers:           24116 kB
Cached:         59442056 kB
SwapCached:       489676 kB
Active:         51318648 kB
Inactive:       75364416 kB
Active(anon):   27646572 kB
Inactive(anon): 28055976 kB
Active(file):   23672076 kB
Inactive(file): 47308440 kB
Unevictable:          24 kB
Mlocked:              24 kB
SwapTotal:      10485756 kB
SwapFree:              4 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:            102236 kB
Writeback:          6156 kB
AnonPages:      66728116 kB
Mapped:         43055064 kB
Shmem:            127816 kB
KReclaimable:     855024 kB
Slab:             971400 kB
SReclaimable:     855024 kB
SUnreclaim:       116376 kB
KernelStack:       10976 kB
PageTables:       747920 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    75086796 kB
Committed_AS:   95698296 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      151392 kB
VmallocChunk:          0 kB
Percpu:            17920 kB
HardwareCorrupted:     0 kB
AnonHugePages:   7997440 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      202656 kB
DirectMap2M:     6404096 kB
DirectMap1G:    124780544 kB

I am also adding some excerpts of the pmap -x command:

Address           Kbytes     RSS   Dirty Mode  Mapping
0000000000400000   15232    3684       0 r-x-- influxd
00000000012e0000   31428    6552       0 r---- influxd
0000000003191000    4668    4380     396 rw--- influxd
0000000003620000     180      92      92 rw---   [ anon ]
0000000004436000     132       0       0 rw---   [ anon ]
000000c000000000   16384    9864    9864 rw---   [ anon ]
000000c001000000   47104   28172   28172 rw---   [ anon ]
000000c003e00000    6144    5016    5016 rw---   [ anon ]
000000c004400000    2048    1616    1616 rw---   [ anon ]
000000c004600000    2048    1620    1620 rw---   [ anon ]
.
.
.
000000c033a00000  155648  120028  120028 rw---   [ anon ]
000000c03d200000    8192    8192    8192 rw---   [ anon ]
000000c03da00000  114688   92768   92768 rw---   [ anon ]
.
.
.
000000c07d000000  270336  234948  234948 rw---   [ anon ]
.
000000cecc000000  176128  174080  174080 rw---   [ anon ]
.
.
000000ced8e00000    2048    2048    2048 rw---   [ anon ]
000000ced9000000  137216  135168  135168 rw---   [ anon ]
.
.
(Towrds the lower)
.
.
00007fa61fdef000    2116    2044    2044 rw---   [ anon ]
00007fa620000000    9664       0       0 r--s- L3-00000023.tsi
00007fa620a00000   40048       0       0 r--s- L5-00000032.tsi
00007fa623200000   40212       0       0 r--s- L5-00000032.tsi
.
.
.
00007fa6a2c00000    9772       0       0 r--s- L3-00000023.tsi
00007fa6a3600000 2098160       0       0 r--s- 000024596-000000002.tsm
00007fa723800000    9920       0       0 r--s- L3-00000023.tsi
00007fa724200000  615764       0       0 r--s- 000024596-000000005.tsm
00007fa749c00000 2100756       0       0 r--s- 000024596-000000004.tsm
00007fa7ca000000    9768       0       0 r--s- L3-00000023.tsi
.
.
.
00007fce82403000   28660    5412    5412 rw---   [ anon ]
00007fce84000000 4194308 2575504       0 r--s- index
00007fcf84001000       4       0       0 r--s- L0-00000001.tsl
00007fcf84002000       4       0       0 r--s- L0-00000001.tsl
00007fcf84003000       4       0       0 r--s- L0-00000001.tsl
.
.
00007fcfc48f7000    1060       0       0 r--s- L0-00000002.tsl
00007fcfc4a00000  262144   35444       0 r--s- 0046
00007fcfd4a00000    2048    1988    1988 rw---   [ anon ]
00007fcfd4c00000  262144   35948       0 r--s- 0045
.
.
00007fd055a00000       4       0       0 r--s- L0-00000001.tsl
00007fd055a01000       4       0       0 r--s- L0-00000001.tsl
00007fd055a02000       4       0       0 r--s- L0-00000001.tsl
.
.
00007fd065c0f000     960     924     924 rw---   [ anon ]
00007fd065cff000    1028       0       0 r--s- L0-00000005.tsl
00007fd065e00000  262144   31952       0 r--s- 003c
.
.
00007fda27fee000    8192       8       8 rw---   [ anon ]
00007fda287ee000       4       0       0 -----   [ anon ]
00007fda287ef000   43076    1164    1164 rw---   [ anon ]
00007fda2b200000     160     160       0 r---- libc.so.6
00007fda2b228000    1620     780       0 r-x-- libc.so.6
00007fda2b3bd000     352      64       0 r---- libc.so.6
00007fda2b415000      16       0       0 r---- libc.so.6
00007fda2b419000       8       0       0 rw--- libc.so.6
00007fda2b41b000      52       0       0 rw---   [ anon ]
00007fda2b428000       4       0       0 r--s- L0-00000001.tsl
00007fda2b429000       4       0       0 r--s- L0-00000001.tsl
00007fda2b42a000       4       0       0 r--s- L0-00000001.tsl
00007fda2b42b000       4       0       0 r--s- L0-00000001.tsl
00007fda2b42c000       4       0       0 r--s- L0-00000001.tsl
00007fda2b42d000       4       0       0 r--s- L0-00000001.tsl
00007fda2b42e000     452     452     452 rw---   [ anon ]
00007fda2b49f000      16       0       0 r--s- L0-00000018.tsl
00007fda2b4af000     268     112     112 rw---   [ anon ]
00007fda2b4f2000       4       0       0 r---- libpthread.so.0
00007fda2b4f3000       4       0       0 r-x-- libpthread.so.0
00007fda2b4f4000       4       0       0 r---- libpthread.so.0
00007fda2b4f5000       4       0       0 r---- libpthread.so.0
00007fda2b4f6000       4       0       0 rw--- libpthread.so.0
00007fda2b4f7000       4       0       0 r--s- L0-00000001.tsl
00007fda2b4f8000       8       0       0 r--s- L0-00000001.tsl
00007fda2b4fa000       4       0       0 r--s- L0-00000001.tsl
00007fda2b4fb000       8       0       0 rw---   [ anon ]
00007fda2b4fd000       8       8       0 r---- ld-linux-x86-64.so.2
00007fda2b4ff000     168     168       0 r-x-- ld-linux-x86-64.so.2
00007fda2b529000      44      40       0 r---- ld-linux-x86-64.so.2
00007fda2b534000       4       0       0 r--s- L0-00000001.tsl
00007fda2b535000       8       0       0 r---- ld-linux-x86-64.so.2
00007fda2b537000       8       0       0 rw--- ld-linux-x86-64.so.2
00007fff74913000     132      12      12 rw---   [ stack ]
00007fff7499b000      16       0       0 r----   [ anon ]
00007fff7499f000       8       4       0 r-x--   [ anon ]
ffffffffff600000       4       0       0 --x--   [ anon ]
---------------- ------- ------- -------
total kB         534464172 112696540 74590512

The series cardinality is 252390866 (So is the VM size in-adequate?)

VM details: Influxdb: 1.8.10 CPU Count: 32 Memory: 128 GB Disk: 1TB (Only 50% used) AWS VM Type: m6a.8xlarge (32CPU,128GB Memory)... EBS IO is 10GBps based on this https://aws.amazon.com/ec2/instance-types/m6a/ Linux Version: Linux 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

The swapiness of the VM is 60 (default). (What does this mean? Initially, I thought it was a percentage but apparently, it's an absolute number?)

How do we debug this disk usage and also if the IOPS has reached its limits? And what is causing so much read rather than write?

Update Vm size was increased to 2x in memory:

Observations

vmstat:

and meminfo:
MemFree:         9436328 kB
MemAvailable:   246346788 kB
Buffers:          829708 kB
Cached:         171495864 kB
SwapCached:       124960 kB
Active:         78087852 kB
Inactive:       167324320 kB
Active(anon):    6396424 kB
Inactive(anon):  2389588 kB
Active(file):   71691428 kB
Inactive(file): 164934732 kB

vmstat

vmstat
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0 2379520 10251664 835112 172756112    1    2   196   596    7    4  2  0 93  5  0

Disk usage atop has significantly reduced to 20%

DSK |       nvme2n1 |  busy     20% |  read      51 |  write   2103 |  discrd     0 |  KiB/r     18 |  KiB/w    165  |               | KiB/d      0  | MBr/s    0.1  | MBw/s   34.0  | avq    13.95  | avio 0.94 ms

AlexD · Answer 1 · 2024-03-24T17:42:32.790

You don't have enough memory.

The OS swapped out whatever unused memory pages it could and there is zero swap activity (si/so columns in vmstat) but still high memory and IO pressure.

You can't rely on free output in your case as InfluxDB memory maps its data and memory-mapped pages are counted as Cached/Available and not as Used. Under memory pressure, these memory-mapped pages are discarded and InfluxDB has to read them back again when needed.

As your data set is 409G but only 52G is available for memory-mapped files then it is possible that your active dataset is larger than the available 52GB and InfluxDB gets into a cycle similar to swap thrashing - it needs to access a memory-mapped page but it isn't in the memory so it reads it back from the disk but at the same time discards another page as it doesn't have memory for the current page and this keeps high read I/O. But this doesn't explain high read I/O when you don't have any queries - you need to check if you actually have high read I/O in that case.

If my guess is correct you should see a large value in Mapped in /proc/meminfo and large total values in pmap output for InfluxDB processes.

Possible mitigations:

tune InfluxDB to reduce its memory usage if possible
add memory
add swap and increase vm.swappiness up to 200 to avoid discarding memory-mapped pages but watch for si/so columns in vmstat and keep them at zero.

Note about vm.swappiness. It is a common misconception that vm.swappiness represents a percentage of allocated memory to initiate swapping. Per documentation, it is "the rough relative IO cost of swapping and filesystem paging, as a value between 0 and 200". With the default value of 60 it means that if the kernel needs to free 200 pages it would discard 140 file pages from the page cache pool (Cached in free) and swap out 60 pages from the anon pages pool (Used in free). With the value of 100 it would discard/swap equally between the pools. These proportions are ignored if there are not enough pages or if free memory is too low.

P.S. I don't know anything about InfluxDB so it is considered a black box here. It could be something internal to InfluxDB that forces it to read all the data. You may possibly find better answers on InfluxDB support forums but the fact that you are low on memory in the current configuration still stays.

UPDATE Additional info from /proc/meminfo shows what I expected - 43G Mapped memory out of 59G Cached. At the same time, it shows a lot of Inactive memory.

Inactive:       75364416 kB
Active(anon):   27646572 kB
Inactive(anon): 28055976 kB
Active(file):   23672076 kB
Inactive(file): 47308440 kB

28 GB of Inactive(anon) are potentially swappable. I would add 5GB of swap and check if it fills up to 100%. If it does and there is no significant swap activity si/so then add another 5GB of swap. If it doesn't fill up to 100% then increase vm.swappiness to 100, 150, 200 while checking si/so. While si/so is kept close to zero, a swap increase should be a safe performance improvement as it saves the memory for a more useful page cache.

On the other hand 47 GB of Inactive(file) doesn't look good. It means that 2/3 of the page cache is mostly missed and the queries are too scattered over all 400GB data set. Saving 10-20 GB by increasing the swap probably won't reduce the I/O load significantly but still worth a try.

Why Swap usage is high for Influxdb: 100% Disk I/O and Swap usage but only 50% memory?

1 Answers1