Achieving High Availability: How GitHub Rebuilt Search for Enterprise Server

Search is the unsung hero of GitHub. It powers not just the search bar but also the issues page, releases, projects, and even counting pull requests. For GitHub Enterprise Server (GHES) administrators, keeping search indexes healthy was a constant challenge, especially in High Availability (HA) setups. The old Elasticsearch-based architecture could lock up, break during upgrades, or require meticulous maintenance. After years of work, GitHub's engineers overhauled the search architecture to make it more resilient. Here's what they did, why it mattered, and how it makes your life easier.

Why was search so critical to GitHub Enterprise Server performance?

Search is woven into nearly every GitHub interaction. You see it in the standard search bars, but it also drives the filtering on the Issues page, the Releases and Projects pages, and the counts for issues and pull requests. If search goes down, those core features degrade or break. That's why GitHub invested heavily in making search highly available. For GHES High Availability setups, a failure in the search cluster could cause a cascade of problems, requiring manual intervention. By rebuilding the search architecture, GitHub aimed to reduce admin downtime and keep enterprise customers productive.

Achieving High Availability: How GitHub Rebuilt Search for Enterprise Server — Source: github.blog

What are High Availability setups in GHES and how do they normally work?

High Availability (HA) is a design pattern that keeps your GitHub instance running even when a component fails. You have one primary node that handles all write operations and traffic, plus one or more replica nodes that stay in sync with the primary. If the primary fails, a replica can take over with minimal interruption. This leader/follower pattern is fundamental to GHES operations. However, integrating Elasticsearch into this pattern proved extremely tricky because Elasticsearch didn't natively support a simple leader/follower model across separate servers.

What specific problems did the old Elasticsearch integration cause?

Earlier versions of Elasticsearch couldn't work with a dedicated primary and replica the way GHES needed. Engineers created an Elasticsearch cluster spanning both primary and replica nodes. Initially, this made data replication straightforward and even boosted performance because each node handled its own searches locally. But the downsides mounted. For instance, Elasticsearch could automatically move a primary shard (which handles writes) from the primary server to a replica. If that replica was taken offline for maintenance, the entire system could lock up. The replica would wait for Elasticsearch to become healthy, but Elasticsearch couldn't recover until the replica came back. This deadlock was a major pain point.

How did the locked state occur and why was it so hard to fix?

The locked state happened because of the clustering behavior. Imagine this scenario: Elasticsearch reassigns a primary shard to a replica node. You then take that replica down for routine updates. The primary node's Elasticsearch now knows it's missing a shard, so it waits for the cluster to become healthy. But the replica won't start up until Elasticsearch reports health, which can't happen because the replica is down. This circular dependency left GHES in a frozen state. The previous workarounds could only mitigate, not eliminate, the risk. Any admin who had to recover from this would face significant downtime.

What previous attempts did GitHub make to stabilize the search architecture?

For several releases, GitHub engineers tried to make the clustered Elasticsearch mode more robust. They added health checks to ensure Elasticsearch was in a good state before allowing certain operations. They built processes to correct drifting states when shards became inconsistent. They even experimented with a search mirroring system that would bypass clustering entirely. Unfortunately, database replication is inherently complex, and these efforts all ran into consistency issues. The mirroring approach, for example, required keeping two separate search databases in perfect sync, which proved impractical at scale. Each fix solved some problems but introduced others, leading to the decision to completely rethink the architecture.

What finally changed in the search architecture?

After years of incremental fixes, GitHub decided to break away from the clustered Elasticsearch model. They developed a new architecture that uses a dedicated Elasticsearch cluster on the primary node only, with separate search data on replicas that stays synchronized through a custom replication layer. This eliminates the cross-server shard movement that caused lockups. The new design also includes better upgrade paths—admins no longer need to follow a precise order of steps to avoid index damage. By removing the clustering dependency, GitHub achieved true high availability for search without the deadlocks. The result: less management overhead for administrators and more reliable search performance.

What benefits do GHES administrators see from the new search architecture?

The biggest win is reduced downtime. Before, a simple maintenance window could turn into an emergency recovery if a shard moved to the wrong node. Now, replica nodes can be taken offline independently without locking the entire search system. Upgrades are smoother—you no longer have to worry about index corruption from out-of-order steps. The new architecture also improves the experience for end users, who get consistent search results even when the replica is temporarily down. Overall, GitHub's rebuild makes GHES more durable, letting administrators focus on their customers' work instead of wrestling with search internals.