A Database Migration Took Down the Entire Platform
A routine schema migration brought down a multi-tenant SaaS platform for 47 minutes during business hours. The migration itself was correct. The deployment strategy was the failure.
- Multi-tenant platform serving 200+ enterprise customers
- Schema change required on the largest table in the system
- SLA breach threshold was 30 minutes of downtime per quarter
Enterprise customers had contractual SLAs. A 47-minute outage didn't just break the product — it triggered penalty clauses, eroded trust with accounts in renewal negotiation, and forced an executive-level post-mortem.
The Scenario
You're the senior engineer at a B2B SaaS company. A new feature requires adding a non-nullable column with a default value to the largest table in your Postgres database — 80 million rows. The migration works perfectly in staging. Production deployment is scheduled for Tuesday morning. How do you deploy it?
No hints. Just judgment.
Maintenance windows feel responsible — you're acknowledging the risk and containing it. But they don't solve the problem; they just move it to a time when fewer people are watching. For global platforms, there is no quiet window. And the pattern doesn't scale: every future schema change requires another negotiated outage.
- Staging environments lie about performance if they don't match production data volume
- ORM-generated migrations optimize for correctness, not for operational safety
- A migration that works is not the same as a migration that's safe to deploy
- Maintenance windows don't scale — invest in zero-downtime patterns instead
- Lock behavior is the first thing to check on any migration touching a large table
- 47-minute outage triggered SLA penalty clauses for multiple enterprise customers
- Phased migration pattern adopted as the team standard — zero migration downtime since
- CI check implemented to flag unsafe migrations before deployment
- Migration review added as a required step in the deployment checklist
Related
We Split the Monolith and Made Everything Worse
A team extracted a billing service from a monolith to improve deploy velocity. Deploys got faster. Everything else got slower, harder to debug, and more fragile. The architecture was right. The boundary was wrong.
Read more ArchitectureThe Cloud Migration That Almost Broke Our Export Service
We had three weeks to migrate file storage from AWS S3 to Azure before our AWS contract renewed. The codebase had a clean storage abstraction built for exactly this scenario. We almost shipped without checking whether everything actually used it.
Read more