Architecturedatabasemigrationdowntimearchitecture
A Database Migration Took Down the Entire Platform
A routine schema migration brought down a multi-tenant SaaS platform for 47 minutes during business hours.
Situation
You're the senior engineer at a B2B SaaS company. A new feature requires adding a non-nullable column with a default value to your largest Postgres table — 80 million rows. The migration works perfectly in staging with 50,000 rows.
Stakes
- Multi-tenant platform serving 200+ enterprise customers
- Schema change required on the largest table in the system
- SLA breach threshold was 30 minutes of downtime per quarter
Production deployment is scheduled for Tuesday morning. The migration worked perfectly in staging. What's your deployment strategy?