Open Source

GitHub's Roadmap to Reliability: Addressing Availability and Scaling for the Future

GitHub details recent availability incidents, the impact of agentic development scaling, and a multi-pronged plan including 30X capacity, isolation of services, and short-term fixes.

Published 2026-05-01 03:41:48 • Bvoxro Stack Staff

In recent months, GitHub experienced two availability incidents that fell short of our standards. We sincerely apologize for the disruption. In this Q&A, we dive into what happened, why it happened, and the comprehensive plan we're executing to ensure GitHub remains reliable even as software development accelerates at an unprecedented pace.

What sparked the urgent need for a 30X capacity increase?

In October 2025, we launched a plan to boost GitHub's capacity by 10X, aiming for major reliability and failover improvements. However, by February 2026, it became clear that 10X wasnâ€™t enough. The catalyst was a dramatic shift in how software is being built: since late December 2025, agentic development workflowsâ€”automated, AI-driven coding processesâ€”skyrocketed. This led to exponential growth in repository creation, pull request activity, API usage, automation, and workloads on large repositories. The pace of change forced us to redesign for a future requiring 30X todayâ€™s scale. For context on how these workflows impact systems, see our explanation of the ripple effects.

GitHub's Roadmap to Reliability: Addressing Availability and Scaling for the Future — Source: github.blog

What caused the recent availability incidents?

Two distinct incidents occurred, both unacceptable. They were not caused by a single failure but by the compounding effects of rapid scaling. Small inefficienciesâ€”like deep queues, cache misses turning into database load, index lag, retry amplification, and slow dependenciesâ€”cascaded across multiple systems. For example, a pull request touches Git storage, merge checks, branch protection, Actions, search, notifications, permissions, webhooks, APIs, background jobs, caches, and databases. Under high load, these interdependencies magnified problems. We have since analyzed each root cause and are implementing fixes detailed in our short-term actions.

How does the surge in agentic development affect GitHub's systems?

Agentic workflows generate a wave of automated operationsâ€”frequent commits, automated PR creation, heavy API calls, and large repository clones. These activities donâ€™t just stress one part of GitHub; they trigger a chain reaction across dozens of subsystems. A single automated PR can exercise a dozen services simultaneously. At high scale, even minor inefficiencies become major: queues deepen, cache misses force database hits, indexes fall behind, and retries amplify traffic. One slow dependency can degrade several product experiences. This is why our focus is on reducing unnecessary work, improving caching, isolating critical services, and removing single points of failure.

What are GitHubâ€™s top priorities for reliability?

Our priorities are clear and in order: availability first, capacity second, new features third. We are committed to ensuring GitHub remains accessible and performant under any load. To achieve this, we are reducing unnecessary work, improving caching, isolating critical services like Git and Actions, removing single points of failure, and moving performance-sensitive paths into purpose-built systems. This is classic distributed systems work: reducing hidden coupling, limiting blast radius, and allowing GitHub to degrade gracefully when one subsystem is under pressure. We are making progress quickly, but incidents like the recent ones show there is still work to do.

What short-term actions did you take to fix bottlenecks?

We immediately addressed several bottlenecks that appeared faster than expected. Key actions included: migrating webhooks from MySQL to a more scalable backend, redesigning the user session cache to reduce database load, and redoing authentication and authorization flows. We also leveraged our migration to Azure to stand up significantly more compute power. These moves alleviated immediate pressure, allowing us to buy time for deeper architectural changes. For a look at the next phase, see how we are isolating critical services.

How are you isolating critical services to prevent cascading failures?

After short-term fixes, we focused on isolating services like Git and GitHub Actions from other workloads. We conducted a careful mapping of dependencies and traffic tiers to understand what needs to be separated. By minimizing single points of failure and reducing blast radius, we ensure that an issue in one area (like a heavy automation run) doesnâ€™t bring down the entire platform. We also accelerated migrating performance-sensitive code from the Ruby monolith to Go, which gives better performance and isolation. This work is ongoing, with priority given to highest-risk dependencies.

What is the long-term architectural vision?

We were already in the process of moving from smaller custom data centers to public cloud (Azure). Now we are accelerating our path to a multi-cloud setup, which will further reduce single points of failure and increase resilience. Additionally, we are redesigning core systems for 30X scale: improving caching, moving to more scalable databases, and rethinking how we handle authentication and authorization. The goal is a GitHub that can handle massive growth without sacrificing reliability. We also continue to monitor agentic workflow patterns to anticipate future strains.

We are committed to transparency and continuous improvement. Stay tuned for further updates on our progress.