Open to Engineering Manager / Director rolesLet's connect
Labs/Architecture/We Had 400 Alerts and Missed the One That Mattered
Architectureobservabilityalertingincident-response

We Had 400 Alerts and Missed the One That Mattered

A memory leak alert fired 90 minutes before the database OOM'd, but was buried in 400 weekly alerts.

Situation

You're the engineering manager responsible for platform reliability. After a 4-hour database outage, the post-mortem reveals the alert fired 90 minutes before the OOM, but your on-call engineer had 47 unread alerts at the time. Leadership wants a plan.

Stakes

  • Primary database heading toward OOM failure with customer data at risk
  • On-call engineer averaging 400+ alerts per week, most non-actionable
  • Alert fired 90 minutes before the outage but was never seen

What's your immediate response to fix the alerting problem?