A Feature Flag We Forgot About Caused a Production Incident

A feature flag created 18 months ago was still in the codebase. When the flag provider had a timeout, the flag evaluated to its default value — which no longer matched the state of the system. The result was a data corruption bug that took three days to fully remediate.

What's at stake

Stale feature flag controlling a write path to a financial data pipeline
Flag provider outage caused the flag to evaluate to a default nobody remembered setting
Data inconsistency propagated for 6 hours before detection

Feature flags are treated as temporary by design but permanent by neglect. This flag was supposed to be removed after a two-week rollout 18 months ago. Instead, it became invisible load-bearing infrastructure — and when it failed, it failed silently into a state that corrupted downstream data.

The Scenario

You're the tech lead at a fintech company. An alert fires: financial transaction records don't match between two systems. Investigation reveals a feature flag that's been in the codebase for 18 months is suddenly evaluating to its default value because the flag provider is timing out. The default routes transactions through a deprecated code path that writes data in an old format. It's been running this way for 6 hours. What do you do?

No hints. Just judgment.

The common mistake

Hardcoding the flag and deploying immediately is the right instinct — stop the bleeding first. But deploying without establishing the corruption boundary first means the remediation team has no clean scope for recovery. They end up diffing entire datasets instead of querying a precise time window. Two minutes of scoping before the fix saves days of recovery work after.

Lessons

Feature flags are temporary by design but permanent by neglect — enforce expiration
A flag's default value is a silent contract with the future — it must remain valid or the flag must be removed
Establish the corruption boundary before deploying the fix — two minutes of scoping saves days of recovery
Flag provider failures should fail safe, not fail to a default that may no longer be correct
Stale flags are invisible technical debt until they become visible incidents

Impact

Data remediation completed in 8 hours instead of an estimated 3 days

← Back to all cases