Test Driven Infrastructure

The more I talk to fellow ops from other companies, the more I get the impression that test driven practices in Operations are somewhat fragmented. Infrastructure as a Code is young, but it is getting more and more mature, therefore all the benefits of TDD are still there to collect. Why exactly one might need to apply TDD practices to infra code? It might be not obvious when infra code is relatively small or you are just starting the transition to IAC. Lets try to to paint a non-tested process of working with infrastructure configuration. For simplicity lets concentrate on cloud resources only, keeping aside OS level configuration:

Ops guy keeps his infrastructure code in git, using a solution like Terraform or CloudFormation. Whenever he needs to perform some change on that infrastructure, he needs to branch out Git with infrastructure code, which potentially can be multiple repositories, then set up environment variables for the tool, like access credentials to the cloud, network IPs, and many other. Then I run the tool to verify the new branch and occasionally create this resource to verify if the tool was actually able to create real resources on the cloud provider. Now, without testing, this verification may be performed ‘by eye’, by looking at the tool report, or at the actual resources created.

This process is complex and has many points of failure, because it relies on user interaction, which may differ.

+-----+     +-----+     +-------+     +-----------------+     +------------+
| git | --> | dev | --> | tests | --> | deployment test | --> | deployment |
+-----+     +-----+     +-------+     +-----------------+     +------------+
               ^-----------/<-----------------/

Here are the weak points of such process:

This leads us to the discussion of what can be done better in this process. Looking at the software development practices and keeping in mind that infrastructure code is also code, we immediately notice that we miss Unit / Integration / End to End tests. Here is a list of benefits which comes to my mind when I think about these tests:

Lets try to imagine the same Ops guy who has TDD practice in place. In order to apply infrastructure changes he will start out with describing his desired state in a test suite, lets say he wants to create a new Autoscaling Group with a new Launch Configuration and a new Security Group. Then he adds necessary changes to the code, and after running local checks for verifying syntax and consistency, starts the test suite. The test suite takes care to create a full dev envionment with all dependencies like VPC with subnets and like with adjustments to make it more cost efficient as no big processing power is required. Then it checks that the state is actually consistent with the desired one and this step may have a few iterations. After the desired state is in place, the test suite can tear down the development environment and CI/CD pipeline can take over to make a test run of deployment on a copy of Production environment.

+-----+     +-----+     +(auto)-+     +(auto)-----------+     +------------+
| git | --> | dev | --> | tests | --> | deployment test | --> | deployment |
+-----+     +-----+     +-------+     +-----------------+     +------------+
               ^-----------/<-----------------/

It is important to mention, that whenever we are dealing with Infrastructure As Code, it is not possible to reliably mock up a cloud provider, as they constantly develop, throttle connections, have different workloads at different times and so on. While we can locally check the code for consistensy and syntax, it is nearly impossible to verify if it will actually run unless you hit the real API of a cloud provider. So far I have not seen a better way to test the code than to interact with real cloud provider.

Now, I do understand that there is a significant overhead in writing tests for infrastructure, because one has to basically code the desired state twice - first time as a part the tests and the second in actual code. It might seem a slowdown of Ops work, but it is not. Coding tests is fast when the process is set up and it is only a matter of adding standardized blocks of tests. This is a price to pay for efficiency boosts from standardized Dev environments, easier debugging and migration testing.

In other posts I would like to talk more about TDD practices in developing infrastructure and get more into details.