A Database Migration Took Down the Entire Platform

A routine schema migration brought down a multi-tenant SaaS platform for 47 minutes during business hours. The migration itself was correct. The deployment strategy was the failure.

What's at stake

Multi-tenant platform serving 200+ enterprise customers
Schema change required on the largest table in the system
SLA breach threshold was 30 minutes of downtime per quarter

Enterprise customers had contractual SLAs. A 47-minute outage didn't just break the product — it triggered penalty clauses, eroded trust with accounts in renewal negotiation, and forced an executive-level post-mortem.

The Scenario

You're the senior engineer at a B2B SaaS company. A new feature requires adding a non-nullable column with a default value to the largest table in your Postgres database — 80 million rows. The migration works perfectly in staging. Production deployment is scheduled for Tuesday morning. How do you deploy it?

No hints. Just judgment.

The common mistake

Maintenance windows feel responsible — you're acknowledging the risk and containing it. But they don't solve the problem; they just move it to a time when fewer people are watching. For global platforms, there is no quiet window. And the pattern doesn't scale: every future schema change requires another negotiated outage.

Lessons

Staging environments lie about performance if they don't match production data volume
ORM-generated migrations optimize for correctness, not for operational safety
A migration that works is not the same as a migration that's safe to deploy
Maintenance windows don't scale — invest in zero-downtime patterns instead
Lock behavior is the first thing to check on any migration touching a large table

Impact

47-minute outage triggered SLA penalty clauses for multiple enterprise customers
Phased migration pattern adopted as the team standard — zero migration downtime since

← Back to all cases