Architectureobservabilityalertingincident-response
We Had 400 Alerts and Missed the One That Mattered
A memory leak alert fired 90 minutes before the database OOM'd, but was buried in 400 weekly alerts.
Situation
You're the engineering manager responsible for platform reliability. After a 4-hour database outage, the post-mortem reveals the alert fired 90 minutes before the OOM, but your on-call engineer had 47 unread alerts at the time. Leadership wants a plan.
Stakes
- Primary database heading toward OOM failure with customer data at risk
- On-call engineer averaging 400+ alerts per week, most non-actionable
- Alert fired 90 minutes before the outage but was never seen
What's your immediate response to fix the alerting problem?