How one computer file accidentally took down 20% of the internet yesterday – in plain English
Summary
Yesterday's major internet outage, which affected nearly 20% of web traffic routed through Cloudflare, was caused by a single, accidental configuration error. A database permissions update caused a system to pull duplicate information while building the bot-detection file, inflating it past a hard size limit of 200 items (it normally contained about sixty).
When Cloudflare's servers tried to load this oversized file, the bot component failed to start, causing many websites using Cloudflare to return HTTP 5xx errors. The issue was complicated by a five-minute rebuild cycle that repeatedly reintroduced the bad file as different database pieces updated. This on-off pattern initially made diagnosis difficult, resembling a potential DDoS attack.
The resolution involved stopping the generation of new bot files, pushing a known good file, and restarting core servers. Cloudflare applied a bypass for certain services around 13:05 UTC, and core traffic began flowing again by 14:30 UTC, with full downstream recovery by 17:06 UTC. Cloudflare identified the failure as highlighting a design tradeoff where strict limits, intended to maintain performance, caused a hard stop instead of a graceful fallback when an internal file was malformed. The company plans to harden internal configuration validation and add more kill switches.
(Source:CryptoSlate)