We Had 400 Alerts and Missed the One That Mattered
The on-call engineer received 400+ alerts per week. When a real incident started — a slow memory leak that would eventually OOM-kill the primary database — the alert was buried in noise. The outage lasted 4 hours. The alert had fired 90 minutes earlier.
- Primary database heading toward OOM failure with customer data at risk
- On-call engineer averaging 400+ alerts per week, most non-actionable
- Alert fired 90 minutes before the outage but was never seen
The monitoring system worked. It detected the problem with 90 minutes of lead time — more than enough to prevent the outage. But the alerting system had trained the on-call team to ignore it. The failure wasn't detection. It was attention.
The Scenario
You're the engineering manager responsible for platform reliability. After a 4-hour database outage, the post-mortem reveals: the alert fired 90 minutes before the OOM. The on-call engineer had 47 unread alerts at the time. Average weekly alert volume is 400+. Leadership wants a plan to make sure this doesn't happen again. What do you propose?
No hints. Just judgment.
Better alerting platforms can group and deduplicate alerts, which reduces the volume the on-call engineer sees. But if the underlying alerts aren't actionable, you've organized the noise without improving the signal. Grouped non-actionable alerts are still non-actionable. The investment in tooling feels productive but doesn't change the fundamental ratio of signal to noise.
- Every post-mortem adds alerts — build a process that also removes them
- Alert fatigue is a system design problem, not a willpower problem
- The metric that matters is not alert volume but signal-to-noise ratio
- Monitoring that doesn't require immediate action should be a dashboard, not a page
- On-call engineers will always optimize for their own sanity — design the system so that optimization aligns with reliability
- Weekly alert volume reduced by over 90%
- Mean time to acknowledge real incidents dropped from 47 minutes to under 5 minutes
- Zero missed real alerts in the quarter following the audit
- Quarterly alert review process adopted by three additional teams