Ansible performance issues on a huge group

Question

So, I have this host group, which consist of 35k VMs.
Aaaand I need to run a playbook over it.
If it does matter - playbook is just a call for community role for installing node_exporter.

But I'm having hard times trying to run it wildly.
I know that running it as is on a such huge host group will definitely cause OOM, so I made a bunch of tries to make it both reliable (make sure it'll finish and not be killed) and fast (but in reality it doesn't)

So I'm doing this :

Using strategy: free
Using serial: 350
Collecting only facts I need:

  gather_facts: true
  gather_subset:
    - "default_ipv4"
    - "system"
    - "service_mgr"
    - "pkg_mgr"
    - "os_family"
    - "selinux"
    - "user"
    - "mounts"
    - "!all"
    - "!min"

Using -f 350 when calling playbook to make it running playbook over 350 machines simultaneously.
Using persistent setting to make it hold ssh connection

use_persistent_connections = True
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=1800s -o PreferredAuthentications=publickey -o ForwardAgent=yes
[connection]
ansible_pipelining = True
[persistent_connection]
connect_timeout = 1800

And, well....it doesn't work. Like the biggest problem I see - it's not really spawning 350 forks at a time for doing that. All I see is ~3-5 processes running something on a remote host (max I've seen was maybe 20 of them?) so it's painfully slow. Running it on 350 hosts takes ~1.5h, which is insane, as calling this playbook/role on 30 machines takes around 3-4 minutes to complete.

Plus, it's OOMing anyway at some point. I'm running it on 32 cores/ 64 Gb RAM VM dedicated for running this one playbook only at the moment and it's OOMing anyway, that's insane.
From my understanding serial setting should preventing that, as it would free up some memory after every batch. But it's not it seems. It just constantly grows.

Now I'm running it using bash script which builds batches of machines and then I'm calling playbook with -l "machine1:machine2:.....:machine350" but that completely wrong.

So my questions here are - why am I not able to run the role/playbook on the host group at once, why it's so slow, why it's OOMing and how to prevent that.

TIA for all the help!

score 9 · Answer 1 · answered Mar 09 '24 at 17:44

Increase forks

You have significant memory and CPU, so a few hundred forks aka worker threads is reasonable even if it they are heavy on resources. ansible.cfg:

[defaults]
forks = 350

serial is the batch size of the play. Automatically running that many hosts at once, to the end of the play. Delete serial to get back the default of 100%, if you intend only to increase worker threads. serial has other effects, most notably if all the hosts in a batch fail, the play will stop.

Bad comparison: imagine you have a very large project to compile. serial is like dividing it up into smaller targets and other chunks. But its still running with make --jobs=5 so the parallelism is limited. Ansible forks set an upper limit on worker threads.

Measure memory use

Find all the processes started on the Ansible controller, estimate their memory use, and find out how that is angering the virtual memory system. You didn't say your operating system, detailed performance analysis is very platform specific.

For example, if you use a systemd Linux, systemd-cgtop -m will show all sessions and services. Find out the total memory use and if this comes up against cgroup limits.

Ansible runs other programs, not just python. Probably ssh for connections, at one per host per task, which is a lot. In theory these are short lived, but connection lifecycle brings us to the next topic:

use_persistent_connections

Confusingly, use_persistent_connections is not intended for POSIX hosts, do not bother setting it to true. This is for the libssh thing for network gear not the OpenSSH ssh connection plugin for Unix/Linux hosts.

In contrast, ssh_args is used by the ssh connection plugin. The default ControlPersist added will tell ssh to keep a connection going, and subsequent low level ssh connections to that same host skip connection and auth. Normally speeds things up. However, this adds to the number of ssh programs running, so if you quickly cycle through 30k hosts that is a lot of ssh programs running.

Consider altering ssh_args to remove the ControlPersist stuff. Take a hit on per connection overhead, but you don't have quite so many ssh running.

Check that your maximum number of processes or pids is quite large, maybe 60000.

Smaller groups

35k hosts is not the largest size inventory I've heard about, but it is big. Ansible is heavy in many ways, so you may struggle to get plays done fast enough by scaling up.

Consider running playbooks on smaller sets of hosts at a time. --limit can target groups as well, much less tedious then providing thousands of hosts on the command line.

Could make your inventory smart enough to tag hosts in various ways, and generate groups from that. Data center region, availability zone, VM host, hardware generation. Or make up your own group names and slice up the inventory into smaller groups.

With smaller groups, you can run multiple ansible-playbook --limit programs in parallel, possibly with xargs or GNU parallel. Or split the runs between different controller hosts.

push option

Default Ansible concept is running on many remote hosts from a central controller. However, some managed hosts can have Python installed and run ansible themselves. So you could install ansible on every managed host, and have it run on itself, in cron or whatever.

ansible-pull script included with ansible is an example of this. Downloading playbook from version control, and automatically --limit to this host.

This is a very different method of operation, and might not work with what you want to run on managed hosts. But it is an option.

score 3 · Answer 2 · answered Mar 10 '24 at 14:53

You should try using Mitogen for Ansible, which replaces the host communication part in Ansible with a different approach, and by their words, increases execution speed 1.25x - 7x and decreases CPU usage by 2x.

I have used it for years in my projects and I haven't had any issues with it.