Terraform does not cut the mustard

2023-Nov-30

I don't get why folks love Terraform so much.

I've worked with Terraform for about 1.5 years now and I'm not impressed. I think I was spoiled by Amazon's internal (and later, external) tools. Terraform unfortunately falls short in many (fixable!) ways. Let's get into it.

Open-loop control

Open-loop and closed-loop control is a concept from control systems. An open-loop control system makes changes. A closed-loop system makes changes and checks the outcome is as intended.

Terraform checks the state file, queries the service for it's current state, works out the diff, creates a plan, then executes it. Other than API errors, there's little that stops Terraform once it starts.

For comparison, CloudFormation has "rollback triggers." These are CloudWatch alarms that can stop the deployment and revert all changes. CloudFormation can also monitor your system for some time after the deployment just in case the alarm takes a few minutes to fire.

This leads into the next missing feature.

No rollback

Terraform stops when it gets an error, leaving the system in whatever state it got to.

This is useful when developing your changes because you can look at the state your resources are in and fix your code. This is not so great in production systems. If something goes wrong, then you want a reliable "oh no please undo" button.

It varies from team-to-team, but undoing a prod change with Terraform can be slow. Once you've noticed the problem you have to create revert commit, get it PR'd, then wait for your deployment system to try again. This is slower than a rollback, which means more customer impact.

Terraform needs that "oh no" button.

Really nasty failures can break the state file, which, hey another segue!

State file surgery

If you hit a bad enough problem, then you can break the state file. I've only seen this happen when upgrading the Terraform version, so it's rare.

Rare or not, it sucks. It's possible to end up with on-going customer impact and a deployment system that's broken itself. Even git revert doesn't save the day here.

The best fix I've seen for this is S3's versioning. Restore the last working state file version, get your git revert, and hope nothing else accidentally recreates the known bad version while you're working, for example no-one is allowed to run plan.

Tangentially-related are locks. A crashed Terraform process can lock out future deployments. You can find yourself editing DynamoDB lock items by hand mid-incident just to get your revert deployment started.

I'm not sure what Terraform should do to fix this. I believe some folks have tried to remove the state file, but that didn't work.

Environment management

Many systems run in multiple places. They have beta stages, regions, and some teams even let their own devs have a copy of the system running in a personal account.

The Terraform way to handle this is ... copy-paste. I've seen more than one system with Terraform laid out as:

my-great-project
|-- beta/
|   |-- main.tf
|   `-- variables.tf
|-- prod-us/
|   |-- main.tf
|   `-- variables.tf
`-- prod-eu/
    |-- main.tf
    `-- variables.tf

This style makes errors more likely. beta can be quite different to prod-eu. When you deploy to beta you aren't necessarily confident the same changes will work in prod-eu.

You also have to manually manage multi-region roll-outs. You commit your change for beta, push it through code review and kick off a deployment. Rinse and repeat for every region. You're forced to have a human in the loop, which is not ideal. Machines always wait for the bake period, they always check alarms, and they always check the test suites. Humans get bored and skip steps that worked in every other region. That's setting folks up for failure.

Terraform needs to disconnect the infrastructure config files from which environment it is to be deployed in. A command line argument for "please deploy to beta please and thank you" would be a great start.

What's good about Terraform?

Terraform's "killer feature" is that it's cross platform. I really appreciate that I can wrangle my DataDog dashboards with the same tool as my AWS resources.

The docs, especially the examples, are really good.

Conclusion

Do not use Terraform for new projects. Check if your cloud provider has built-in deployment tools, and check those tools have closed-loop control and automatic rollbacks.

Terraform is not beyond rescue though. Many of these problems are fixable. With the exception of the state file, I think most are missing features, not fundamental design flaws.