Bvoxro Stack

Cloudflare's 'Fail Small' Initiative: A Stronger, More Resilient Network for Customers

Cloudflare completes 'Code Orange: Fail Small' project, introducing safer config changes, health-mediated deployment via Snapstone, and improved incident management to prevent global outages.

Bvoxro Stack · 2026-05-05 00:23:30 · Education & Careers

Introduction: Building a More Resilient Cloudflare

After months of intensive engineering work, Cloudflare has completed a major project internally known as “Code Orange: Fail Small”. This initiative was born from the need to address the global outages that occurred on November 18, 2025, and December 5, 2025. While achieving perfect reliability is an ongoing journey, this effort has delivered concrete improvements that will make the network more resilient for every customer. The project focused on safer configuration changes, reducing the blast radius of failures, revamping incident management procedures, and ensuring long-term stability through prevention of drift and regressions. Let’s dive into the key changes and what they mean for your traffic.

Cloudflare's 'Fail Small' Initiative: A Stronger, More Resilient Network for Customers
Source: blog.cloudflare.com

Key Areas of Improvement

Safer Configuration Changes

One of the most significant changes is how Cloudflare handles internal configuration updates. In the past, configuration changes could propagate instantly across the entire network, sometimes causing widespread impact. Now, high-risk configuration pipelines have been identified, and new tools ensure that changes no longer go live immediately. Instead, they are deployed progressively using a health-mediated deployment methodology—the same approach used for software releases. This means that as a configuration change rolls out, real-time health monitoring automatically catches any anomalies and can revert the change before it affects your traffic. For customers, this translates to fewer disruptions and a safer experience during updates.

Reducing the Impact of Failure

Even with the best precautions, failures can still occur. To minimize their effect, Cloudflare has implemented strategies to limit the blast radius. This includes better isolation of different network components and ensuring that a problem in one area doesn’t cascade across the entire infrastructure. By designing systems to fail small—rather than allowing individual issues to bring down large portions of the network—the team has significantly improved overall reliability.

Revised Incident Management and “Break Glass” Procedures

The way Cloudflare responds to emergencies has also been overhauled. The recent outages highlighted the need for clearer protocols, especially when it comes to emergency access (often called “break glass” procedures). The updated incident management framework includes faster escalation paths, better-defined roles, and improved coordination between teams. These changes mean that when an incident does occur, the response is quicker and more effective, ultimately reducing downtime.

Introducing Snapstone: Health-Mediated Configuration Deployment

A centerpiece of this initiative is a new internal component called Snapstone. Before Snapstone, applying health-mediated deployment to configuration changes was possible but required significant per-team effort and was inconsistent across the network. Snapstone provides a unified system that bundles configuration changes into packages and gradually releases them with health mediation built in. It allows teams to dynamically define any unit of configuration that needs monitoring—whether it’s a data file like the one that caused the November outage, or a control flag like the one involved in the December incident. Snapstone’s flexibility means it can adapt to different types of configuration changes, not just specific past failures, making the network more resilient against future unknowns.

Cloudflare's 'Fail Small' Initiative: A Stronger, More Resilient Network for Customers
Source: blog.cloudflare.com

Long-Term Resilience: Preventing Drift and Regressions

Improvements are only valuable if they last. To prevent drift and regressions over time, Cloudflare has introduced automated checks and controls that ensure new changes adhere to the new safety standards. This includes continuous monitoring of configuration health and periodic reviews of incident response procedures. By baking resilience into the development lifecycle, the team ensures that the hard lessons learned from past outages remain baked into the system—even as the network evolves.

Enhanced Communication During Outages

Another important lesson from previous incidents was the need for clearer communication when things go wrong. Cloudflare has strengthened how it keeps customers informed during an outage, providing more timely and transparent updates. This includes better status pages, more frequent incident reports, and clearer explanations of what went wrong and what is being done to fix it. The goal is to ensure that even during challenging events, customers are never left in the dark.

Conclusion: A Stronger Network for Everyone

The completion of “Code Orange: Fail Small” marks an important milestone, but Cloudflare acknowledges that resilience is never truly “done.” It remains a top priority across all development efforts. Customers can expect fewer disruptions from configuration changes, faster incident responses, and a fundamentally more robust network. As Cloudflare continues to evolve, the principles of failing small and learning fast will guide every engineering decision—ensuring that your traffic flows smoothly and securely.

Recommended