How I learned patching sucks

2023-Dec-10

You should still probably apply your patches. Eat your veggies kids.

Back in the day, at $Job-N, I was "in charge" of security things. This was after the Heartbleed fiasco, so we were trying to get our ducks in a row.

We decided to regularly patch our Debian "fleet." We had 3 or 4 servers at that time, one of which was our test server. Each of the live servers housed 2 to 5 customers, depending on customer size. They also had a database cluster on each server, and each customer had a database in the cluster.

This was the days before server-name indication was common, so each customer got an IP address, which nginx listened on. The IP address determined which TLS certificate the customer got.

We applied the patches on our test server. No issues there, so we arranged downtime with our customers. We'd need to reboot the server because the updates included kernel patches. Of course, we started with the first server which housed some of our earliest (and therefore largest) customers.

The maintenance window arrived and I started. There were more than I expected, but I made a list of the affected packages. Just in case I needed to rollback. Then I installed the roughly 50 updates and rebooted the server.

Everything looked fine. The database came up, customer systems booted, and logs started flowing. Then I tried to go to the customer's domain. Nothing. The server was just not on the network. But it was on the network: I was ssh'd in to it!

I started trying to debug what I was seeing. Maybe nginx was down? But I could HTTP responses using curl while on the server, so it was up. lsof looked right for nginx. I got my colleagues involved, but they were equally stumped. At some point I found myself fruitlessly poking around with ifconfig and disabling/re-enabling the networking stack.

I decided I wanted to know if there was something "wrong" with the incoming packets, so I eventually turned on tcpdump and started watching. This was how I noticed the packets from my curl commands were different to the packets from my browser.

Eventually, I realised that packets addressed to our virtual IPs to nginx weren't making it to nginx. I can't recall the exact mechanism, but I went into our networking config and made a few tweaks. One more networking reload and the site sprang to life as if nothing had happened.

I think we spent the afternoon writing sad emails to our customers, and then I got started on the incident report. I learned some things that I rationally knew, but hadn't internalised.

The first is that testing, no matter how thorough, is never perfect. I ran the exact same commands on our test server as our live server. But I didn't realise the networking config was different, so I was testing a different system.

The second was that applying many changes at once invites disaster. If every change has chance p of causing a Big Problem, then applying n changes has 1 - (1 - p)^n chance of causing a Big Problem. That ^n sucks.

Third, rollback plans don't matter unless you've practiced them and you use them. That's why Amazon puts so much emphasis on auto-rollback. Partly, it's speed. Auto-rollback is faster than people. It's also consistency, people view rollbacks as a big or scary action, so they avoid it. It should be the safe option!

Fourth, weird configs are not well tested. Stick to common configs if possible. I'm unsure how many other folks hit the same bug. I don't think Debian would have released a patch which off bits of folks' networking if they'd known about it. Our testing server was unaffected because it used the more common configuration.

Fifth, maybe don't apply changes to your biggest customers first. I'm not sure if the later servers shared the same config, so it might not have helped this specific incident. If possible, don't build your service so that a single server having a Bad Day causes an outage. We spent some time talking about load balancers after that.

Finally, maintenance windows have a weird "do or die" psychological effect. If I had not been under time pressure, I might've stopped when I saw the update list was so different to test. No guarantees.

Today, I recommend a few practices to mitigate some or all of these. On AWS, avoid yum update. Apply your patches by deploying the most up-to-date AMI on a fresh instance and re-deploy onto it. You can stand-up and test this new server before destroying the old one. Use a rolling deployment strategy with automatic rollback on customer health alarms.

Finally, frequently release small changesets, ideally just one commit. Notice that I did not recommend slowing down or avoiding security patches!