FAPolicyD overhead slowing down a server

Question

We have a server with the Alma Linux 9.3 OS. By default (as well as all current RHEL-like OSs) it has fapolicyd enabled. There is also an application server (WildFly/JBoss/Java) running on that server. The deployed application processes some data files (submited by users) and it works fine in the standard situation.

However currently, there is a period of time when the application needs to process 1000-ish files per minute. In such a situation, the fapolicyd overhead is utilising ~15% of CPU which we evaluated as too much.

I was unable to find anyone with a similar problem on the internet.

I'm also suprprised there is no fapolicyd tag here on ServerFault.

Questions:

Is there a way to optimize fapolicyd configuration so that it could decide faster whether it allows or denies a file access?
- One thing that comes to my mind is the ordering of custom rules.
- Maybe using wildcard vs. using literal rules.
Any hints how to evaluate how much important fapolicyd is for us?
- Whether we can just turn it off or whether it is really a good idea to have it running despite the huge overhead.
- Whether other distributions also use something like fapolicyd or whether it is "just additional security" and SELinux is enough. (I know they are not the same.)

Sources:

score 1 · Accepted Answer · answered Jan 18 '24 at 23:19

Allow listing executed programs is among the most effective security features. Without it, a compromised user account could execute any arbitrary payload. Or users install programs to their home directory that they should not. Although it is an optional feature that you decide to enable or not.

Inspecting all such file system calls has a performance hit. Although the overhead can be minimized by optimizing the rules and database.

Measure if performance is acceptable from a user's perspective. A response time focused objective, something like "99.9% of application API calls will complete under 1 second", will detect real problems, not just trends in resource utilization.

First for some background on fapolicyd note the performance introduction from the README:

PERFORMANCE

When a program opens a file or calls execve, that thread has to wait for fapolicyd to make a decision. To make a decision, fapolicyd has to lookup information about the process and the file being accessed. Each system call fapolicyd has to make slows down the system.

To speed things up, fapolicyd caches everything it looks up so that subsequent access uses the cache rather than looking things up from scratch. But the cache is only so big. You are in control of it, though. You can make both subject and object caches bigger. When the program ends, it will output some performance statistic like this into /var/log/fapolicyd-access.log or the screen:
Permissive: false
q_size: 640
Inter-thread max queue depth 7
Allowed accesses: 70397
Denied accesses: 4
Trust database max pages: 14848
Trust database pages in use: 10792 (72%)
Subject cache size: 1549
Subject slots in use: 369 (23%)
Subject hits: 70032
Subject misses: 455
Subject evictions: 86 (0%)
Object cache size: 8191
Object slots in use: 6936 (84%)
Object hits: 63465
Object misses: 17964
Object evictions: 11028 (17%)
In this report, you can see that the internal request queue maxed out at 7. This means that the daemon had at most 7 threads/processes waiting. This shows that it got a little backed up but was handling requests pretty quick. If this number were big, like more than 200, then increasing the q_size may be necessary. Note that if you go above 1015, then systemd might need to be told to allow more than 1024 descriptors. In the fapolicyd.service file, you will need to add LimitNOFILE=16384 or some number bigger than your queue.

Another statistic worth looking at is the hits to evictions ratio. When a request has nowhere to put information, it has to evict something to make room. This is done by a LRU cache which naturally determines what's not getting used and makes it's memory available for re-use.

In the above statistics, the subject hit ratio was 95%. The object cache was not quite as lucky. For it, we get a hit ration of 79%. This is still good, but could be better. This would suggest that for the workload on that system, the cache could be a little bigger. If the number used for the cache size is a prime number, you will get less cache churn due to collisions than if it had a common denominator. Some primes you might consider for cache size are: 1021, 1549, 2039, 4099, 6143, 8191, 10243, 12281, 16381, 20483, 24571, 28669, 32687, 40961, 49157, 57347, 65353, etc.

Also, it should be mentioned that the more rules in the policy, the more rules it will have to iterate over to make a decision. As for the system performance impact, this is very workload dependent. For a typical desktop scenario, you won't notice it's running. A system that opens lots of random files for short periods of time will have more impact.

Another configuration option that can affect performance is the integrity setting. If this is set to sha256, then every miss in the object cache will cause a hash to be calculated on the file being accessed. One trade-off would be to use size checking rather than sha256. This is not as secure, but it is an option if performance is problematic.

do_stat_report = 1 in config to enable the statistics report, then restart fapolicyd if it has not recently. Analyze /var/log/fapolicyd-access.log and note the patterns of what PIDs are opening which files.

Note the ratio of "hits" to "misses". Higher hit ratio is better, accessing the in-memory database is much faster than file system access and processing. Increase obj_cache_size in config to the number of files your system has at once. A possible upper bound is the number of used inodes in the data file system, as from df -i output. Which might be excessive, but if you have the memory why not cache a couple hundred thousand entries.

Review configuration in fapolicyd.conf. integrity values other than none or size will compute checksum and have overhead. Especially if you have lots of misses from processing new files, this could be a significant amount of CPU. q_size should be larger than the "max queue depth" on the access report, however I doubt queue size needs to be increased.

Review rules, in compiled.rules from rules.d. RHEL and Fedora populate trusted files from rpm, do not allow execute of unknown files, do not allow the ld.so trick, and allow most opens. If you do modify rules, think about the performance impact of doing more things while that open syscall is waiting.

And as always you can profile what exactly is going on while troubleshooting. perf top will print what functions are on CPU, and is even better when debuginfo is installed. bcc-tools package has some neat scripts: opensnoop and execsnoop to list open and exeve calls in real time.

Ultimately, its your decision on what controls to put in place to only allow execution of unauthorized programs. Allow list immediately in the exec call, like fapolicyd, is of course very powerful. A less comprehensive alternative could be to restrict shell access: not allow people interactive shells, and lock down permissions of home directories. Or, if a data file system should have no programs at all, consider mounting it noexec. A good security audit would not treat the checklist as immutable, rather it would list alternative controls in place and why.

FAPolicyD overhead slowing down a server

1 Answers1

PERFORMANCE