Ansible at scale: where it breaks for us

Ansible was the right tool for our team for a long time. Agentless, SSH-based, YAML we could reason about. We grew from ten hosts to a few hundred on it comfortably. Then we grew past a thousand, and the edges started to show. This is what actually hurts, and what we did about each one.

Pain point 1: run time

A full site playbook against 1200 hosts took 45 minutes on a warm workstation. Most of that was the gather-facts step. I verified:

time ansible -i inventory -m setup all > /dev/null
# real    14m12.002s

14 minutes to collect facts from 1200 hosts, with forks: 50 (the default is 5). Bumping forks to 200 helped but made my laptop fan audible:

ansible -i inventory -m setup --forks 200 all > /dev/null
# real    4m17.003s

4 minutes is tolerable but the workstation becomes a CPU bonfire. We moved to an AWS Fargate ansible runner for full-fleet runs and kept my laptop for small scoped changes.

Fact caching helped the most. Redis-backed fact cache meant the 14 minutes became seconds on repeat runs:

# ansible.cfg
[defaults]
gathering = smart
fact_caching = redis
fact_caching_connection = ansible-facts.internal:6379
fact_caching_timeout = 86400

gathering = smart skips fact collection if the cache is fresh. In practice facts change slowly, so 24h cache is fine.

Pain point 2: check mode lies

Ansible has a --check mode that’s supposed to show what would change without making changes. In a small role, this works. In a big role with dynamic filenames, templated commands, and command/shell modules that don’t support check mode, you get a mix of:

Modules that honor check and report truthfully.
Modules that run anyway.
Modules that skip entirely and return “skipped”.
Modules that run but always report “changed” regardless.

The net effect is that --check output is not trustworthy on real codebases. We spent some time annotating our roles with check_mode: false or changed_when: false where appropriate, but every new play adds potential for drift. We now treat check mode as a sanity check, not as a safety net. Real safety comes from a staging environment.

Pain point 3: the rolling update dance

Updating a cluster of 300 similar machines without taking everyone down at once is something ansible can do with serial: 10% and handlers. It mostly works. What does not work well:

If a handler fires and fails on host N, the play stops, leaving the cluster in a mixed state. Recovery is manual.
serial: 10% rounds down; on a 12-host cluster it runs 1 at a time, not 2.
max_fail_percentage: 5 is a blunt instrument.

We wrote a small custom strategy plugin that knows about our service’s specific “can we drain this host” check, and only advances to the next batch when the drained hosts are back in service. This is 400 lines of Python we probably should not have had to write. It works.

Pain point 4: inventory generation

Our inventory is dynamic, pulled from a CMDB. The dynamic inventory script takes 12 seconds to run cold. Every playbook invocation pays that 12 seconds. We cache the inventory output:

ansible-inventory --list > /tmp/inv.json
ansible-playbook -i /tmp/inv.json site.yml

Faster, but now we risk running against stale inventory. We refresh the cache every hour in a cron. It is fine. I just wish there were a first-class “inventory cache” setting in Ansible the way there’s fact cache.

Pain point 5: Python on targets

Ansible modules run in Python on the target. When we onboarded some minimal alpine-based VMs without Python, ansible’s raw module can be used but most modules don’t work. We either install python3 on every target or use the raw module for the bootstrap. This is a minor thing at small scale and a constant mild irritation at large scale. Mitogen helps by reducing how many times Python needs to start up, but mitogen has compatibility breakage with newer ansible versions in practice.

Where ansible is still great

For most of our infrastructure, ansible is fine. The places it hurts are:

Huge fleet sizes where run time dominates.
Very dynamic environments where “apply and check” is not a good fit.
Bootstrap of minimal systems where Python is not there yet.

For the first problem we are moving some of the ad-hoc tasks to SSH-based tooling and keeping ansible for declarative config. For the second we are shifting more state into immutable images (Packer and cloud-init do the heavy lifting, covered in see my post on baking Hetzner images with Packer and cloud-init). For the third, we install python via a tiny kickstart/cloud-init step before ansible gets involved.

What we almost switched to

We evaluated Salt and Chef. Both are capable. Both would have been roughly the same switching cost. The deciding factor was that our team has 20 people who are comfortable with ansible and we did not want to retrain. If we were starting from scratch in 2024 with a 1000-host fleet, I would seriously consider a mix of Packer for images and a thin push tool (ansible-like) for the ongoing drift.

Reflection

Ansible is “good enough” for a very wide range of teams. The places it breaks are all places where you grew larger than its assumptions. The fixes are incremental, not a rewrite. Know where the edges are, monitor your run times, and be willing to write a small custom plugin when the default strategies don’t match your rollout model.

Related: see my post on Terraform state locks and the S3 bucket that wouldn’t let go for our other infra-as-code pain story.