Skip to content

GITHUB 2026-03-03

Read original ↗

How we rebuilt the search architecture for high availability in GitHub Enterprise Server

Summary

GitHub Engineering describes a year-long rewrite of the GitHub Enterprise Server (GHES) search-indexing substrate, shipping in 3.19.1 (opt-in via ghe-config app.elasticsearch.ccr true, default in ~2 years). The old HA design ran a single multi-node Elasticsearch cluster spanning the primary and replica GHES nodes — a topology-misaligned choice forced by the fact that classic Elasticsearch didn't support a leader/follower pattern at the cluster level, while GHES's app layer is strictly leader/follower (primary takes all writes, replicas are read-only). The cross-node cluster could silently migrate primary shards onto the replica node, where they'd become stuck if the replica was taken down for maintenance (the replica blocked on Elasticsearch health, and Elasticsearch blocked on the replica rejoining — a mutual deadlock). The fix: collapse to per-node single-node Elasticsearch clusters and wire them up with Cross Cluster Replication (CCR). Each GHES node now operates an independent single-node cluster; CCR replicates Lucene segments leader→follower at the ES-level, mirroring GHES's application-layer primary/replica semantics. GitHub built custom lifecycle workflows around CCR (failover, index deletion, upgrades, bootstrap for pre-existing indexes) since ES only handles the document-replication leg.

Key takeaways

  • The pre-rewrite topology was a forced workaround: classic Elasticsearch couldn't support primary/replica cluster semantics, so GHES ran one ES cluster spanning both GHES nodes. This was chosen for "straightforward replication + per-node search locality" but created a structural mismatch between ES's rebalancing freedom and GHES's read-only-replica invariant. (Source: article § pre-state)
  • The failure mode was a mutual-blockage deadlock: ES would occasionally migrate a primary shard onto the replica GHES node for rebalancing; if the replica was then taken down for maintenance, the replica waited for ES to be healthy before starting up, but ES couldn't become healthy until the replica rejoined — a locked state requiring manual repair. (Source: article § problem)
  • Multiple years of mitigation attempts were structural dead-ends: health-check gates, drift-correction processes, and an abandoned "search mirroring" in-house DB-replication effort all failed to fix the underlying topology mismatch. "Database replication is incredibly challenging and these efforts needed consistency." (Source: article § "attempting to build a 'search mirroring' system")
  • The resolution is Elasticsearch Cross Cluster Replication (CCR): an upstream ES feature for replication between clusters. Applied here by making each GHES instance its own single-node ES cluster, then pointing CCR from primary→replica. CCR replicates at the Lucene segment level — i.e. data that has already been durably persisted to disk in the leader. (Source: article § "What changed?")
  • CCR's auto-follow API only covers indexes created after the policy exists, so GHES has pre-existing indexes that need a bootstrap step: list managed indexes on primary + replica, filter out system indexes, call ensure_follower_index or ensure_following for each, then install the auto-follow policy for future indexes. (Source: article § "Under the hood", pseudocode)
  • Elasticsearch only handles document replication; the rest of the index lifecycle is GitHub's problem: GHES engineers had to build custom workflows for failover, index deletion, and upgrades on top of CCR. This is the exact failure mode customer-maintained replication-over-managed-storage systems always surface. (Source: article § "Under the hood")
  • Migration is one-way-inside-a-release and operator-triggered: set ghe-config app.elasticsearch.ccr true → run config-apply or upgrade to 3.19.1 → on restart ES consolidates all data onto primary, breaks cross-node clustering, restarts replication via CCR. Duration scales with instance size. The feature is opt-in through 2026-ish and will become default over "the next two years." (Source: article § "How to get started")

Architecture

Pre-rewrite (the failure-prone topology):

 ┌──────────────────┐          ┌──────────────────┐
 │   Primary Node   │          │   Replica Node   │
 │ ┌──────────────┐ │          │ ┌──────────────┐ │
 │ │  ES Instance │─┼──┐    ┌──┼─│  ES Instance │ │
 │ └──────────────┘ │  │    │  │ └──────────────┘ │
 └──────────────────┘  │    │  └──────────────────┘
                       ▼    ▼
            ┌───────────────────────────┐
            │ ONE Elasticsearch Cluster │   ← ES can move
            │   spans both GHES nodes   │     primary shards
            │   primary-shard ≠ pinned  │     to replica
            └───────────────────────────┘

Post-rewrite (CCR-based, matches GHES app-layer topology):

 ┌──────────────────┐            ┌──────────────────┐
 │   Primary Node   │            │   Replica Node   │
 │ ┌──────────────┐ │   CCR      │ ┌──────────────┐ │
 │ │ ES Cluster   │ │  (leader   │ │ ES Cluster   │ │
 │ │ (single-node)│━┿━━━follower)┿━▶│ (single-node)│ │
 │ └──────────────┘ │            │ └──────────────┘ │
 └──────────────────┘            └──────────────────┘

CCR replicates Lucene segments after leader-side durable persistence. Failover / index deletion / upgrades: GitHub-authored workflows around the CCR primitive.

Bootstrap pseudocode (from article)

function bootstrap_ccr(primary, replica):
  primary_indexes = list_indexes(primary)
  replica_indexes = list_indexes(replica)

  managed = filter(primary_indexes, is_managed_ghe_index)

  for index in managed:
    if index not in replica_indexes:
      ensure_follower_index(replica, leader=primary, index=index)
    else:
      ensure_following(replica, leader=primary, index=index)

  ensure_auto_follow_policy(
    replica,
    leader=primary,
    patterns=[managed_index_patterns],
    exclude=[system_index_patterns]
  )

Systems / concepts / patterns extracted

  • systems/github-enterprise-server — the self-hosted distribution of GitHub; distinct from GHEC in that the customer operates the storage topology. 3.19.1 is the first release to ship CCR-mode HA.
  • systems/elasticsearch — the search substrate. Historically this is one cluster; the GHES rewrite uses it as many single-node clusters joined by CCR.
  • systems/lucene — ES's underlying storage engine; CCR replicates at the Lucene-segment granularity (persisted immutable index segments), not at the request/document API layer.
  • concepts/cross-cluster-replication — the ES primitive (CCR): one-way leader→follower replication between clusters, with auto-follow for index-creation patterns.
  • concepts/primary-replica-topology-alignment — structural lesson: the replication topology of your storage layer should mirror the write-ownership topology of your application. The old GHES design's pain came from misalignment (ES rebalances freely, GHES app layer has strict primary/replica).
  • patterns/single-node-cluster-per-app-replica — deploy one single-node storage cluster per app-replica host, then link them with store-level replication, rather than running one storage cluster spanning multiple app replicas. Trades ES's internal rebalancing / horizontal-scale story for topology alignment and operational simplicity.
  • patterns/bootstrap-then-auto-follow — CCR's auto-follow policy only covers new indexes; pre-existing indexes need an imperative bootstrap step that enumerates current indexes and attaches followers, then installs the auto-follow policy for future creations. Generalizes beyond CCR to any policy-based replication/attachment system where the policy is new-only.

Operational numbers

  • GHES release shipping the feature: 3.19.1.
  • Rollout cadence: opt-in now via ghe-config app.elasticsearch.ccr true, default-on over "the next two years."
  • Migration duration: "may take some time depending on the size of your GitHub Enterprise Server instance" — not quantified; size scales with repo count / issue count / code index size.
  • Config surface: ghe-config app.elasticsearch.ccr true + config-apply or upgrade.

No QPS, latency, or storage-footprint numbers disclosed.

Caveats / out-of-scope

  • CCR isn't free on Elastic's side: classic CCR was a paid Elastic-Stack commercial feature. The post doesn't discuss licensing implications for GHES (whose customers already pay for a commercial GHES license), whether GHES bundles the CCR-capable Elasticsearch distribution, or how this interacts with any GHES open-source-component-distribution rules. (The post says GitHub Support will "set up your organization so that you can download the required license" — suggests a new license artifact is involved, details undisclosed.)
  • The post doesn't quantify the replication lag introduced by segment-level CCR vs the pre-rewrite in-cluster primary-shard replication. For search use cases (accept-a-few-seconds-of-stale) this is likely fine; not numerical.
  • Horizontal scaling story is unclear: single-node clusters can't shard an index across machines. For very large GHES installations (multi-TB search indexes), what replaces ES-cluster-level horizontal distribution? The post doesn't say. Possibly resolved by the fact that the primary-GHES-node ES itself is still a single-node cluster but can scale vertically, while operational HA is handled by CCR-to-replica. Worth flagging as an open question.
  • Rollback story undisclosed: the config-apply / upgrade is described as one-way (consolidate to primary → break clustering → restart via CCR). No mention of downgrade / rollback if CCR mode causes a regression.
  • Bootstrap idempotency / re-run semantics not detailed: ensure_follower_index / ensure_following suggest idempotent, but behavior under partial-failure mid-bootstrap isn't spelled out.
  • This is a GHES-specific post (self-hosted; customer operates the HA pair). github.com itself runs a different, larger search infrastructure out of scope here.

Cross-refs

Raw

raw/github/2026-03-03-how-we-rebuilt-the-search-architecture-for-high-availability-4f98b255.md

Original: https://github.blog/engineering/architecture-optimization/how-we-rebuilt-the-search-architecture-for-high-availability-in-github-enterprise-server/

Last updated · 200 distilled / 1,178 read