On 18 November 2025, services representing a substantial portion of global web traffic went dark for several hours. If your team tried accessing X (formerly Twitter), OpenAI, ChatGPT, or thousands of other websites that morning, you probably saw HTTP 5xx errors. The cause wasn’t a cyberattack or DDoS incident. It was something more mundane but equally disruptive.
A routine database permissions change at Cloudflare triggered a bug in their bot management system, causing widespread service disruption. A routine database tweak. That’s all it took. The outage started around 11:30 UTC, with services gradually restored from 14:30 UTC onwards and full recovery by 17:06 UTC.
A database tweak nobody expected to matter
The problem started with a database permissions update within Cloudflare’s infrastructure. Their bot management system relies on what’s called a “feature file”, a data file containing rules for identifying and handling bot traffic. This file gets generated from database queries and then distributed to servers across Cloudflare’s global network.
The database permissions changed, and suddenly an unfiltered query had access to more data than intended. The query pulled in duplicate entries and unnecessary information, causing the feature file to double in size (from around 60 features to over 200). The bot management software had a hard limit of 200 features built in to prevent excessive memory usage. Under normal circumstances, nobody expected this limit to be reached.
How one oversized file took down the internet
As the oversized file propagated across Cloudflare’s network, servers started trying to load it into memory. The bot management software hit a size it wasn’t designed to handle and essentially crashed. Engineers call this a “panic” when software detects an unrecoverable error and shuts down rather than carrying on in an unstable state.
Here’s the thing: the software wasn’t malfunctioning. It behaved exactly as programmed when faced with unexpected input. The real issue was that a routine database change could produce a file large enough to trigger this safety mechanism. The configuration change bypassed the checks that should have caught this before it reached production systems.
The faulty file reached Cloudflare’s edge servers, the machines that sit between users and websites, and the impact spread fast. These servers filter traffic and protect against attacks, but when the bot management component crashed, they couldn’t handle incoming requests properly. Websites protected by Cloudflare started returning HTTP 500 and 502 errors.
The scale of disruption reflected just how much of the internet runs through Cloudflare. X, OpenAI, and ChatGPT all went offline simultaneously. For end users, it didn’t matter which site they tried to access. They all showed the same error pages.
Customers on Cloudflare’s newer proxy engine saw HTTP 5xx errors. Those still on the older system didn’t get error messages, but bot protection stopped working correctly. All traffic got assigned a bot score of zero, which meant businesses with rules to block bots would have seen false positives everywhere.
The damage spread beyond the core proxy. Workers KV, Cloudflare’s key-value storage service, stopped functioning. So did anything built on top of it: Access (their zero-trust product), and even Cloudflare’s own dashboard. The dashboard problem was particularly awkward. It uses Workers KV internally and has Turnstile as part of the login flow, so if you didn’t already have an active session, you couldn’t log in to see what was happening.
How they tracked it down
With thousands of services suddenly failing, pinpointing the exact cause wasn’t straightforward. Cloudflare’s response team initially thought they were dealing with a major DDoS attack. The pattern of network-wide failures looked like what happens when attackers flood systems with malicious traffic. Engineers started checking for signs of coordinated attacks.
What really threw them off was that even Cloudflare’s status page went down. That page is hosted completely separately from Cloudflare’s infrastructure with no dependencies on their network. When your own monitoring and communication tools fail alongside everything else, it looks less like an internal bug and more like someone’s attacking you.
The breakthrough came when engineers correlated the timing of system failures with the deployment of the updated feature file. Once they stopped distributing new versions and rolled back to the previous configuration, affected machines started recovering.
Why this matters for businesses
Cloudflare protects millions of websites and handles a significant slice of global internet traffic. When a provider at that scale has problems, the effects ripple across seemingly unrelated services and companies.
Your business can be doing everything right, but if your infrastructure provider experiences an outage, you’re going offline too.
The irony? Database permission changes happen constantly. This one just exposed an assumption nobody questioned. What made this change risky was the lack of filtering on the query that generated the feature file. The query assumed it would always produce acceptable output, and that assumption became invalid after the permissions update.
Software limits exist for good reasons, but they also create points of failure. Systems need monitoring that flags when operations are approaching those limits, not just when they’re exceeded. There should be buffer space between normal operation and hard limits.
The recovery process
Once Cloudflare identified the cause, they stopped distributing the problematic file and either rolled back affected machines or restarted them with corrected files. The phased recovery meant different parts of their network came back at different times. Services that received the faulty update later in the cycle recovered more quickly.
By 17:06 UTC, all systems were back to normal, though some customers saw slow load times and intermittent connection errors as traffic patterns stabilised.
When consolidation becomes vulnerability
Internet infrastructure keeps consolidating around a handful of major providers. Cloudflare, AWS CloudFront, Fastly. These services handle massive amounts of traffic for everyone from individual bloggers to multinational corporations. The efficiency and cost savings are compelling, but the concentration creates systemic risk.
We’ve noticed that major outages increasingly stem from the interaction of multiple systems rather than single component failures. A database change that’s perfectly safe on its own combines with software assumptions that are reasonable in isolation, and the result is widespread service disruption. This pattern shows up across the industry.
When you’re evaluating infrastructure providers or reviewing your disaster recovery plans, start here:
Single points of failure matter. Even with the best providers, outages happen. How long can your business operate if a key service goes down? Start by documenting your RTO (Recovery Time Objective) for each critical service. Can you tolerate a five-hour outage? Thirty minutes? The answer shapes everything else.
Redundancy has costs. Using multiple CDN providers, maintaining fallback options, or accepting temporary downtime. Each approach has trade-offs in complexity, cost, and effort.
Testing catches some things, not everything. Cloudflare’s testing procedures didn’t catch the interaction between a database change and bot management software limits. Complex systems can fail in ways that aren’t obvious until they happen. The challenge isn’t just having redundancy and monitoring in place. It’s anticipating how different parts of your infrastructure might interact under unusual conditions.
Cloudflare has published a detailed post-mortem on their blog that’s worth reading if you’re responsible for infrastructure decisions. They’re implementing fixes: adding filtering to database queries, adjusting size limits, implementing better monitoring. The harder work is reviewing processes to understand how a routine change reached production without anyone spotting the downstream effects.
For businesses relying on cloud infrastructure and third-party services, the November outage proved something important. Preparation isn’t about preventing failure. It’s about surviving it when your infrastructure provider can’t. And yes, that includes the providers who promise you five nines of reliability.
