6

We have troubles with NFS server. There is 30TB of small files (mail server NFS). After some hours first or second client (where are mail services) stop responding on services ports, because there is not possible open file from NFS. Listing is possible.

Load is very huge (on screen is 400, but after few minutes is 10000). CPU is idle.

I tried everything. There is not connection to max open files, max tcp, etc.

Here is mount point. Client is RockyLinux 8

192.168.91.7:/iwdata/mail on /data type nfs4 (rw,noatime,vers=4.1,rsize=65536,wsize=65536,namlen=255,acregmin=1800,acregmax=1800,acdirmin=1800,acdirmax=1800,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.91.12,local_lock=none,addr=192.168.91.7)

Huge load, cpu idle

Dave M
  • 4,494
Pavel
  • 487

1 Answers1

2

If you call lsof -n that might yield the listing of open files. From that you can pipe it through all sorts of filters to identify the culprit.

Expect that to take a while however - so I'd call it once and redirect the output to /tmp somewhere.

Alternatively a quicker less informational approach would be to call

 find /proc -maxdepth 3 -path '/proc/*/fd/*' -type l

This will dump the output of every file open, per pid (which should be quicker) to /tmp/results.txt and you can perform post processing on it.

IE

cut -d"/" -f3 /tmp/results.txt  | sort | uniq -c | sort -k1n

Number of open files per pid.

What this might be indicating to you however is that the NFS server is not cooperating with all your open file requests, which is an indication to you you may want to tone down the number of open files permitted to be opened per user (or process at least) or globally on the system.

This isn't going to fix your demand problem (lots of open file requests) but it will prevent the entire system locking up and the NFS server becoming upset with your numbers of requests.

If you find from the output one process has thousand and thousands of files open, you should check cat /proc/<process_id>/limits and see what the max open files limit is set to.

If its set to unlimited or a large number, reduce it in the offending process and see what it does. There are a of lot ways to do this and the approach depends on what the process is that is doing it and how it was started.

  • Use the LimitNOFILE declaration if the service is a systemd service.
  • Set a limit on nofile in /etc/security/limits.conf for the affected user if a individual user is responsible.

You can also alter the limit at runtime for running processes using prlimit --nofile <soft>:<hard> -p <pid> but your mileage will vary and I would suggest avoiding doing something like that unless you know what the outcome might be.

After making a change like this expect programs to error out with "Too many open files" errors.

Finally, just to stress, this isn't fixing the demand side problem, just restricting the supply of open files. Effectively its a band-aid to keep your system going at the cost of denying access to resources for greedy (or greedier) processes.

The demand side issue has multiple solutions depending on the cause.

  • Raise the limit of parallel open file requests on the NFS server if possible.
  • Manually open a file on the NFS server and time how long it actually takes. Is it actually slow?
  • Find out why the software is opening all these files and what can be done in configuration or software to alter the behaviour.
  • Spread the demand to other resources at your disposal.
Matthew Ife
  • 24,261