Tuning alarms

2021-Jan-01

You can tune alarms using Laplace's Law of Succession.

When the on-call's pager goes off, they should be able to react quickly. If the system has a lot of false alarms, they may waste time checking if there is a real problem. If there are too many false alarms, the on-call may need prioritize some pages and ignore others. This lowers system availability as the time-to-repair increases.

There's also the human side of false alarms. On-calls will make lower-quality decisions when their sleep is distrupted, again, reducing the time-to-repair. If they feel unable to fix the alarm problem they'll gradually become demotivated and leave.

On the other hand, we need to detect problems quickly. If the alarm takes a long time to fire, then the time-to-detection will increase. Customers may find the problem our on-call does. We need to balance the alarm's sensitivity with it's false alarm rate. I explain how we can do this quickly and easily, with comparatively little data.

The main idea is to use Laplace's Law of Succession. You can quickly estimate the chance of a real problem given an alarm. Finally, I explain how to include more data in the estimate.

Existing approaches

I do not intend to replace these methods. They are mostly ways to chose an alarm threshold, window or metric. They don't check how likely a problem is given an alarm.

Alert on burn rate

Alerting on burn rate (also known as SLO-burn) lets you choose alarm thresholds based on your error budget. I think it's best explained with a few examples.

Say the error budget is 30,000 errors per month, then you would have a handful of rules to check this:

Spending the budget in one day: 21 errors/minute
Spending the budget in a week: 179 errors/hour
Spending the budget in the month: 500 errors/6 hours

You choose your alarm window accordingly. You might have 3 1-minute data-points for (1), but 2 1-hour data-points for (2).

This links your risk-tolerance to your alarms. In theory, on-calls are only paged when the risk budget is being spent too quickly. You should never get a nasty surprise at the end of the month when you find you missed small but consistent drop.

This is a decent method to find the alarm threshold, but it doesn't say much about false positives. False positives are still possible, as the network might have a blip lasting a few minutes. On-call probably can't do anything about it, but we'll still page them. This could be a serious problem for low-traffic systems as the noise swamps the measurement. You can work around this by making the alerting window longer, but there's no quantitative method to know when you've got the right alerting window.

Symptom-based alerting

Symptom-based alerting is more about what to alert on, rather than choosing the alarm level. For this, you use a metric the customer cares about, such as latency or error rate. The main idea is the difference between "what" and "why". Customers might see a high error rate (what), but it's caused by a bad database fail-over (why). Similarly, customers don't care that a disk has failed if their data is still available.

This helps guide you towards alarms with fewer false positives, but it's not a perfect solution. Datadog lists an exception to the rule:

"It is sometimes necessary to call human attention to a small handful of metrics even when the system is performing adequately. Early warning metrics reflect an unacceptably high probability that serious symptoms will soon develop and require immediate intervention."

They list disk space as a classic metric, but how do you differentiate an "early warning sign" from something that customers don't care about? You need some measurement.

Overalert and tune

You can deliberately over-alert, then tune the alarms until they're acceptable. It's easy to accidentally fall into this pattern.

Over time, this creates reasonable alarms and it is very quick to set up. Just set your alarms to be over aggressive and leave it to on-call to tune them. However, there are some side-effects of this policy.

You need to clearly communicate this to the on-calls and give them time to tune the alarms. Without it, your on-calls are just stuck with over-aggressive alarms.

It's hard to know when you've gone too far. You may over-tune an alarm and start missing actual outages. To avoid this, you need to move the alarm in small increments, which causes more false alarms.

These false alarms "spend" the on-call's good will. When the system is new there may be many false alarms. On-calls need to triage all of these, then tune the alarms. This is a lot of effort, and it is borne by the on-calls -- usually in their off-hours.

Almost all teams suffer some turn over. Gradually, the rationale for setting the alarms at the given level disappears. New on-calls won't have the context to know if they can tune it or if it's a "real" problem. This can be alleviated with clear Service Level Objectives (SLOs).

Finally, this approach rarely removes alarms. If you're simply measuring something customers don't care about, then almost every alarm on that metric will be a false alarm. These alarms should be removed, but it's hard to argue for this without clear SLOs and supporting data.

A new approach: Laplace's Law of Succession

If we have the metric before we deploy the alarm, we can measure the alarm before annoying on-call!

We use two measurements:

The number of genuine outages this alarm would have fired for, and,
The number of times the alarm would have wrongly fired.

Now we can use Laplace's Law of Succession to estimate the probability that the alarm represents an actual outage:

(GenuineAlarms + 1)/(TotalAlarms + 2)

This assumes that you didn't know anything about the alarm performance before looking at the data, so you expected it to give you an equal number of false and genuine alarms.

You can use this to work out how good your existing alarm is. You could use this calculation to help argue that an alarm is doing more harm than good, but remember that different people in different roles have different tolerances for false positives.

This works with alerting on burn rate as it helps you chose your alerting window. It also works with symptom-based alerting as it shows you what represent genuine alarms and what are "red herrings."

I think this is better than over-alerting and tune. It has roughly the same effect with less effort and less cost. You can still be slightly aggressive if you're not confident, but you can tell your on-calls and make more reasoned trade-offs. For example, you can make your alarms for high-impact more aggressive and make the rest less-aggressive.

Example

You choose a new alarm for your system and look at the historical data. You note that, were the alarm running at the time, you would have seen 5 genuine alarms and 2 false alarms. You conclude that the your best estimate of the probability of an outage, given this alarm has fired, is 75%.

You might decide that this is too many false alarms and choose a different threshold. After looking at a few thresholds you might decide that the metric is too noisy and try a different metric. Maybe you look for more data (e.g. another region) and re-calculate this to get more confidence.

Using prior knowledge

If you have been running the system for a while then you don't need to use Laplace's Law of Succession. You can update a Beta distribution directly. Look at your existing alarms -- even for other systems -- and use this as the initial Beta distribution (a, b). Update with the measurements you have for your new alarm. The more data you have for your new alarm, the stronger your update will be.

Limitations

You can't do this without historical metrics, like when you launch a new system. In these cases you have to estimate or look at how other systems have set their alarm thresholds.

It takes time to do this, and it's worse if you don't have good records. Most of your time will be spent with your monitoring system, trying to work out "is this a false positive?" Fortunately, this improves with time as you get familiar with where the records (if any) are kept and build up a repertoire of queries for your monitoring system.

Conclusion

You can use Laplace's Law of Succession to estimate the probability of an actual outage given an alarm. You can do this before you turn on the alarm, which saves your on-calls getting woken up for nothing.

False alarms break down your on-calls' trust in the alarms. People make worse-quality decisions when sleep deprived. A high false positive rate can make your on-calls ignore some alarms and cause people to leave the team. This leads to worse outages and higher turn-over.

Personally, I think the best way to use this is to supplement alerting on burn rate and symptom-based alerting. You can leap-frog the tuning part of setting up a new alarm and go directly to an alarm that rarely fires when there's no problem.