This keeps happening: TLS expiry

2024-Jan-11

TLS certificate expiry keeps taking out big sites, but it doesn't have to

On 2023-Dec-31, Keybase had an outage because a TLS certificate expired. Expired TLS outages have happened at GitHub, Oculus/Meta, Microsoft Teams, and Vim.

It doesn't seem to matter if you're a large organisation or an Open Source project. Expired TLS certs keep causing outages. I've even seen automated renewal systems fail from time to time.

Failure modes

Getting a TLS cert is cheap and easy with LetsEncrypt. Even better, they offer a battle-tested automatic renewal tool, certbot. Unfortunately, battle-tested does not equal infallible. Customers might still see an expired cert in a few scenarios.

Renewal itself can fail. Your scheduler could miss the renewal time, either by a bug or by misconfiguration. For example, if you write your cron expression to run on the 31st day of the month, then June might be a bad month for you. The scheduler might be powered off on that day for unrelated reasons.

Even if invoked at the right time, the renewer process may fail. Problems range from "disk full" to "shared library broken by apt update." Even third-party services like Amazon Certificate Manager (ACM) can fail if the required DNS records have gone missing (It's always DNS.)

If renewal works we still need to distribute the certificate. This might mean scp ing it to a fleet of back-end services, or sending it to a load balancer. Unfortunately, this can also fail! People change firewall rules, audit (and remove) "unused" credentials, etc.

Finally, even if you renew your certificate and correctly distribute it, your services need to use it. Some services need a restart, SIGHUP, or other poke to pick up the new certificate. You need something to issue the poke, and the certificate needs to be in the right directory, with the right permissions, and so on.

With a fleet, you can also see partial failures where some but not all hosts get the new certificate.

Suffice to say, automatic certificate renewal reduces the risk, but as always the risk is never zero. As with all risk controls, automatic renewal may make the risk acceptable for some organisations. Other organisations will want to further reduce that risk.

Additional controls

I like to think about risk controls in three categories: Prevent, detect, and mitigate. Automatic renewal is a prevention, but you can't do any mitigation without detection. That said, you get some detection "for free" via customer complaints. Unfortunately, that "free" detection costs a lot of customer trust.

Monitoring

Services like ACM send a DaysToExpiry metric, which you can set alarms on. I don't know what metrics other services or tools send, you might need to build it yourself.

You probably want a process that periodically sends a DaysToExpiry metric to your monitoring-system-of-choice. Your TLS terminating process might be a good place, for this, but you can also use a probe. If you use a probe, you have to make sure the probing process is running too.

That metric shouldn't rely only on the leaf certificate's notAfter value. It's rarer, but the intermediate and root certificates can also expire. In theory, you can catch this when you mint the certificate, but I don't know any tools that do the check. Your probe should walk the chain and use the soonest notAfter date for it's DaysToExpiry time.

Set a couple of alarms on that metric: One that cuts a ticket if it's going to expire "soon" but after the renewal should've got it. Set a second alarm to page if the expiry is "iminent" in case the staff missed the ticket. Both alarms should assume missing data is breaching.

You can also put an alarm on your incoming request rate. A drop here could mean your TLS cert has expired, you've pushed broken code to the front-end, botched your DNS config, or anything else Bad that doesn't show up as errors.

Finally, put this metric on your dashboard, including horizontal lines for when you expect renewal, when you expect a ticket, and when you expect a page. I'm assuming your org already has a regular dashboard review. That's your final chance to notice the problem even if your alarms are misconfigured.

Synthetics

Synthetics are similar to end-to-end tests. They regularly check a single use-case on your system. If you have a lot of use-cases, you choose only the most important. You set them up so you get an alarm if a bad change gets all the way out to production. They are, in many ways, the last line of defense for detecting problems before a customer notices and calls you.

Synthetics should fail and raise an alarm if it can't successfully complete it's use-case. This includes DNS not resolving, invalid or explired TLS certs, or any other major problem.

Your synthetic might be able to send metrics on the certificates it sees. This might ease your burden, but not every synthetic provider can check inside your infrastructure. They can usually get your publicly-facing services, but internal or private services might go unmonitored.

Jitter

Mint your certificates with slightly different expiry dates. For example, if you launch a multi-region service, give yourself a 5 to 10 day spread on the expiry dates. Keep adding to that spread every time you renew.

If all your other controls fail, at least you won't have a global outage on the expiry date. You'll have one or two regions go down, which sucks. On the other hand, you now have advance warning that your other regions are about to take an outage within the next few days.

Runbooks

You need to know what to do during an outage. Wrangling openssl req under duress sucks. Write down how to get fresh certificates out of your tools. Write down how to get them onto your services, what file permissions they need how to safely bump the processes, and so on.

This helps reduce the duration of the outage (hurrah, mitigation) and how much time folks spend minting certificates under normal circumstances, that is, assuming you're still doing it by hand.

Launch checklists

Put these controls in your service launch checklist. If you don't have a mechanism to make sure everyone's doing The Right Thing, a few won't. Those few can damage your whole company's reputation. Launch checklists help make sure folks are making risk trade-offs that senior leaders are comfortable with. They also make sure folks don't simply forget in the rush of launch.

Conclusion

This keeps happening, but it doesn't need to. TLS certificate have an unavoidably wide blast radius, so it makes sense to use extra safeguards on them.

Those safeguards generally fall into one of three categories; prevent (autorenew, calendar reminders), detect (alarms on DaysToExpiry , synthetics), and mitigate (jitter, runbooks). You can use this framework to find controls for other threats.

You don't need to use all of the controls in this article. You should use those that fit your budget, risk level, and risk tolerance. For example, I rely only on calendar reminders and auto-renewal for this site because I don't earn my living through it. Business-critical services get a more complete treatment.

Acknowledgements

I didn't invent any of these techniques. I learned the monitoring, jitter, and synthetics tricks from Amazon, Fowler introduced me to launch checklists with "Production-ready Microservices", I found out about runbooks from Beyer et. al. in "Site Reliability Engineering".

2024-Jan-13 Addendum

Thank you to Sam/Cass (They/Them) on the Overthinking Everything Discord for giving me the useful feedback on some controls I didn't mention.

something that you sort of touched on but is worth calling out is that you can rotate the cert well ahead of time (e.g. a month before expiry) to give plenty of recovery time. You can also use short expiry times (like letsencrypt) as a forcing function to make sure your automation works reliably (and alert on the automation having errors like any service)

Another one is making sure you have a good understanding of all the different certs that might be being used (e.g. edge certs on your CDN and origin certs on your internal APIs).. which if you miss can lead to some exciting problems :p

I entirely agree!

Renew early

Renewing early gives you time to respond. If you try to renew 45 days before the expiry date, then you have a long time to respond. This works well with the alarm schemes noted above. If you see a TLS certificate with only 43 days to expiry, then you should get a ticket. If you see a TLS certificate with, say, 5 days, someone should get paged. You lose this if you renew with only a few days to respond.

There are limits to this. Some tools refuse to renew certificates very early. They're usually a good balance, but you'll want to know it so you can set your alarms and to check you're comfortable with the time you'll have to respond.

Renew often

On the other side of the balance sheet, renewing often is also a control. If you renew every 90 days, then you have at most 90 days between a change and it causing problems. Combined with renewing early, you can find and fix breaking changes in the systems before they cause customer impact.

Renew early and often applies equally well to other service-to-service credentials.

Know what you've got

You can't use any of these controls, except maybe runbooks if you don't know what TLS certificates you have. You often find there're many more certificates than you might expect.

If you don't have such a list, you can start with your DNS records. That includes DNS records for private or internal systems. For a large organisation, I think this will miss some certificates. You may need to go through every service and build a list of every dependency, then check those dependencies.

You may even have to look into the abyss, to fight the Great Evil that is mutual TLS (mTLS). In mTLS, both the service and client check each others' certificates when starting a connection. These certificates don't have corresponding DNS records, they don't show up in lists of dependencies, and there is no endpoint you can poll for metrics. Worse, the server can reject a valid client certificate if the server's root certificates are out of date. Making mTLS reliable and safe is well beyond the scope of this article. mTLS: Not even once.