Infrastructure as code and TDD

Question

Infrastructure as code tells us to use tools that automate your builds. Great. Tools like ansible, chef, puppet, salt stack and others push us towards writing how infrastructure looks like, while resolving the differences.

In Salt Stack those bits are called states. If state does not match the reality, the tool will resolve it for us. In other words - we are writing a test for our infrastructure and if the test fails, the tool will fix it on its own. At least that's the idea.

XP teaches us to use TDD and the question is if it's applicable to infrastructure? The tooling suggests it is.

I can imagine few type of tests that can be very useful.

We write smoke tests that are bundled with the deployed service to ensure that end-to-end the deployed service works and runs as expected. This would be an API call or/and systemctl check to make sure what we just deployed works. A lot of this functionality can be covered in the same states since tools like ansible have states to make sure a service is running.

There is project Molecule that allows running individual roles (as ansible calls its states) against docker or another temporary virtualisation engine. This forces to decouple roles and allows to execute them in isolation from the playbook while working on them. Tests mostly allow mocking the variables that the role is supposed to work with. Other examples seem like a duplication of the ansible engine though (assert a file belongs to a user...).

ThoughtWorks tech radar right now praises tools like inspec, serverspec or goss for validating that the server meets the spec. But we are writing a spec, aren't we?

So is there a point in further testing infrastructure if we are describing infrastructure in states/roles? I could suspect this becomes more required in larger organisations where one team provides the spec and other follows, or if there is a large set of roles maybe you want to run a subset of those and get a speed benefit from tests? I'm struggling to see why you would write a test if you could have a role/state for the same question in mind.

score 6 · Accepted Answer · answered Dec 04 '17 at 18:29

In short, I see two categories of tests for your infrastructure: 1) does it have everything you need to run your application and 2) does it not have any superfluous stuff.

First and foremost, you can treat the test suite of your actual software as a kind of "meta test" for your infrastructure. As long as you create the infrastructure from scratch for each test run, and the test suite runs completely on that infrastructure (i.e., doesn't use outside services) the fact that the whole suite is green means that your codified infrastructure is sufficient as well.

Second, especially from the security perspective, you can write tests against your infrastructure. I.e., if one part of your infrastructure is a VM running Linux, you might write a test that does a port scan against that VM, to make sure that there are no unintentional ports open, which may have been installed by an unintended apt-get install side-effect. Or you could write tests that checks whether any unexpected files were changed after your proper test suite has completed. Or you could check the ps outputs of your VMs or Docker containers for unexpected processes and such, build white-lists etc., and thus get automatic notification if some 3rd party package changed in an undocumented (or unnoticed way) in some upgrade.

These second type of tests are, in a way, similar to what you would do in a classical ops setting anyway, i.e., hardening your servers and checking for intrusions, avoiding full ressource and such.

Pier · Answer 2 · 2018-03-22T15:34:22.457

It looks like everybody here assumes an IAC tool always run as expected, but I can tell (from my own experience) this is not always the case, otherwise unit test would actually be useless.

I remember a picture saying "Ansible playbook ran, everything is fine" with a building burning in the background ...

Running a declarative state and having the server in this actual declared state are 2 different thing from my point of view and experience at least.

A broad and heterogeneous environment, spread across multiple DC, reachable through public network etc... There are multiple reason for which a state could not be applied, either fully or partially.

For all these reasons, there is room for unit test allowing one to get a snapshot of the actual server state, which, again, might differ from the aimed state.

So I'd say yes, unit test are useful even in a IAC managed environment.

EDIT

What about the non-regression side of the dev branch of the IaC code base ? so you'd make changes to your code in dev branch and merge it to the prod branch hoping for not breaking everything ? unit test are so valuable, and usually simple to implement I do not see why one would code without this feature.

Reference (in french sorry for that): https://fr.slideshare.net/logilab/testinfra-pyconfr-2017

Dan Cornilescu · Answer 3 · 2017-12-05T03:07:22.103

IMHO it's rather redundant to write TDD tests for items entirely covered by the IaaC state specification. Doing so implies that the effectiveness of the IaaC is questionable - why would you use it if so?

Looking at it from a different prospective IaaC itself (if/when done properly) incorporates capabilities already tested and considered to be functioning reliably. Which is what makes it attractive and which makes writing TDD matching tests redundant.

For example, an IaaC configuration specifying a system with SSH being installed already incorporates reliable checking for SSH being correctly installed and, if not, mechanisms for properly installing it. Which makes a TDD test for checking if SSH is installed redundant. If your IaaC config also specifies sshd to be started and listening on a specific port then TDD tests for sshd running and listening to the respective port would also be redundant.

Note that my answer is not targeting TDD or any other type of testing that checks if your IaaC configuration as a whole fits a certain purpose. That remains valid and can be used in TDD, CI or similar checks during development of that IaaC specification - I believe @AnoE's answer is applicable in such case.

score 1 · Answer 4 · answered Oct 26 '18 at 14:26

In my experience one of the main differences between Dev and Ops are "heavy run time dependencies". Installing packages heavily depends on repositories, networks or valid keys, or lets say instantiating a new cloud server - it depends on your provider resources.

In terms of server-provisioning even if you did not change your provision code, your image will be valid most of the times but sometimes not. So I think testing is really essential for providing working images.

If you go beond single servers things get even worse ... how you will test reachability in whole network-setups? Including DNS-resolution, routing and firewalling? Even if your IaaC providers API works as expected (I've seen wired issues in this area) I really like TDD in this case.

Because I've not found any testing tools in this area, we wrote one in our spare-time: https://github.com/DomainDrivenArchitecture/dda-serverspec-crate

So I think TDD is really important in the DevOps world!

Infrastructure as code and TDD

4 Answers4