Skip to content

GITHUB 2026-01-15

Read original ↗

When protections outlive their purpose — a lesson on managing defense systems at scale

Summary

GitHub Engineering's public post-mortem on a quiet-but-sustained false-positive class: legitimate logged-out users browsing GitHub hitting "Too many requests" errors during normal activity. Root cause was a set of protection rules added during past abuse incidents that had been left in place after the threat pattern evolved — emergency mitigations that accumulated into permanent technical debt. Rules combined composite signals (industry-standard fingerprinting fused with platform-specific business logic) to distinguish abuse from legitimate traffic: 0.5–0.9 % of fingerprint-matching requests also matched business-logic rules and were blocked 100 % of the time, producing a fleet-wide false-positive rate of ~0.003–0.004 % of total traffic (3–4 requests per 100 K). Investigation had to trace blocked requests across multiple infrastructure layers (edge → application → service → backend), each with different log schemas, before identifying which layer actually made the block decision. Remediation: audit and remove stale mitigations, keep ones still matching ongoing threats, and structurally commit to lifecycle management — treat incident mitigations as temporary by default, make permanence require an intentional decision, and put post-incident evaluation of emergency controls on a recurring cadence. No fleet-scale numbers for the remediation itself (how many rules were audited, how many removed, pre/post false-positive rate) — the post is a public acknowledgement + structural commitment, not a quantified-outcome retrospective.

Key takeaways

  • The failure class is not the original mitigation — it's the absence of a sunset. "Protection rules added during past abuse incidents had been left in place. These rules were based on patterns that had been strongly associated with abusive traffic when they were created. The problem is that those same patterns were also matching some logged-out requests from legitimate clients." Every individual mitigation was correct when added; the structural gap is that none of them had an expiration date, owner, or post-incident review trigger. (Source: article § What we found)

  • Composite signals filter false positives from themselves — but don't eliminate them. The blocks combined industry-standard fingerprinting with GitHub-specific business-logic rules. "Among requests that matched the suspicious fingerprints, only about 0.5–0.9 % were actually blocked; specifically, those that also triggered the business-logic rules. Requests that matched both criteria were blocked 100 % of the time." The composite design is why the FP rate is small (≈3–4 per 100 K total traffic, not per 100 K fingerprint matches) — but it does not make FP management optional, because even rare FPs at GitHub scale mean real users on a bookmarked URL hit the block wall during ordinary browsing. (Source: article § What we found + charts 2, 3, 4)

  • Emergency-response-time decisions are correct at the moment of the incident and wrong later. "During active incidents, you need to respond quickly, and you accept some tradeoffs to keep the service available. The mitigations are correct and necessary at that moment. Those emergency controls don't age well as threat patterns evolve and legitimate tools and usage change." The frame is explicitly time-bounded decision quality — a mitigation that was net-positive on day one becomes net-negative after the threat evolves and the legitimate-traffic population shifts to overlap with the fingerprint. (Source: article § introduction + § lifecycle diagram)

  • Multi-layer protection architecture is a prerequisite for defense but a cost-multiplier for investigation. GitHub's stack layers defenses at edge, application, service, backend — each with protection mechanisms (DDoS protection, rate limits, authentication, access controls) and different log schemas. Any of these layers can be the one that blocks a specific request; tracing which layer made the call requires correlating logs across all of them. "Maintaining comprehensive visibility into what's actually blocking requests and where is essential." Observability of the defense layers is as load-bearing as observability of the features. (Source: article § Tracing through the stack + infrastructure diagram)

  • The investigation workflow, not the rule audit, is the canonical contribution. The public investigation path was: user reports (external-social-media timestamps + behaviour patterns) → edge-tier logs (confirm request reached infrastructure) → application-tier logs (find the 429) → protection rule analysis (identify which rule matched). Each step narrows from "symptom" to "cause" across a different schema. Without the top-to-bottom correlation, every 429 looks identical to every other 429. (Source: article § Tracing through the stack)

  • Structural remediation is observability + default-temporary + post-incident-practice, not a one-time rule cleanup. GitHub's stated three-workstream remediation: (1) Better visibility across all protection layers to trace the source of rate limits and blocks (concepts/observability applied to the defense surface); (2) Treating incident mitigations as temporary by default. Making them permanent should require an intentional, documented decision (patterns/expiring-incident-mitigation); (3) Post-incident practices that evaluate emergency controls and evolve them into sustainable, targeted solutions (a recurring-cadence audit gate, not a one-shot cleanup). (Source: article § What we're building)

  • HAProxy is disclosed as a foundational layer of GitHub's custom protection infrastructure. "We've built a custom, multi-layered protection infrastructure tailored to GitHub's unique operational requirements and scale, building upon the flexibility and extensibility of open-source projects like HAProxy." Fits the GitHub pattern of building proprietary operational surfaces on open-source substrates (cf. custom pack construction on Git, Scientist on Rails). Specific HAProxy role/version/extension shape is deliberately undisclosed to avoid telegraphing defense mechanisms. (Source: article § Tracing through the stack)

  • Public apology + no quantified outcome is a deliberate posture choice. "We apologize for the disruption. We should have caught and removed these protections sooner." The post doesn't disclose how many rules were audited, how many were removed, what the pre/post FP rate looked like, how many affected users were contacted, or whether the rollout of lifecycle-management tooling is done or in-progress. The shape is public acknowledgement + structural commitment, not a quantified retrospective. Consistent with Cloudflare's 2025-07-14 / 2025-11-18 / 2025-12-05 incident framing: name the missing discipline, commit to structural fixes, defer numbers to the work landing. (Source: article § What we did + § What we're building)

Architectural shape disclosed

The four protection-layer stack

GitHub layers defenses at four ordered tiers (the post ships a simplified diagram that "avoids disclosing specific defense mechanisms and keeps concepts broadly applicable"):

Layer Role Example mechanisms
Edge First-touch absorb + coarse filtering DDoS protection, IP-level blocks
Application Session- and feature-level rate limiting 429 responses, auth-aware rate limits
Service Per-service quotas + business-logic protections Composite-signal abuse rules
Backend Data-layer controls + access checks Access controls, tenant isolation

Each tier has legitimate reasons to block a given request. During incidents, mitigations are added "at any of these layers depending on where the abuse is best mitigated and what controls are fastest to deploy" — so investigation has to be prepared to find the block at any tier.

The composite-signal filter

The Service-layer rules that misfired combine two independent detection inputs:

  • Fingerprint match — industry-standard techniques (TLS fingerprints, request-shape patterns, header combinations) that identify a class of client.
  • Business-logic match — GitHub-specific rules about what a client of that class is doing — request paths, auth state, action sequences, etc.

A request is blocked only if both match. This is what keeps the FP rate low in aggregate (0.003–0.004 % of all traffic); the conditional FP rate within fingerprint-matched traffic is 0.5–0.9 %, and within both-matched traffic is 100 %. See concepts/composite-fingerprint-signal for the general shape.

The mitigation lifecycle

From the post's own lifecycle diagram: "control added during incident → works initially → remains active over time without review → eventually blocks legitimate traffic." Each of the four steps takes a different time interval and a different team to notice. Without an expiration or review gate, the only natural end-state is "eventually blocks legitimate traffic" — which is exactly the detection mechanism the post describes operating on.

Operational numbers disclosed

  • False-positive rate within fingerprint-matching traffic: 0.5–0.9 % (fluctuating over a 60-minute sample).
  • False-positive rate within both-matching traffic: 100 %.
  • False-positive rate across total traffic: ~0.003–0.004 % (3–4 requests per 100 000).
  • Sample window for the published charts: 60 minutes immediately before the cleanup.

Operational numbers not disclosed

  • Number of protection rules audited.
  • Number of rules removed vs kept.
  • Pre-cleanup vs post-cleanup FP rate (is 0.003–0.004 % the before or the after?).
  • Duration the stale rules had been in place before the audit.
  • Number of distinct incidents the stale rules were originally added in response to.
  • User-impact numbers — how many unique users hit the block? Any repeat-affected users?
  • Timeline — when did social-media reports start? When did the audit begin? When was the cleanup complete? (Article published 2026-01-15, timing of the cleanup relative to publication is unstated.)
  • Rollout status of the new lifecycle-management tooling ("treating incident mitigations as temporary by default" — is the mechanism built, rolled out, or aspirational?).
  • Whether the three workstreams (observability, default-temporary, post-incident practice) are sequenced, parallel, or on distinct timelines.
  • HAProxy role specifics — the version, the custom Lua / Go / C modules, whether the composite rules live in HAProxy configuration or an upstream layer.

Caveats

  • Defense-in-depth post-mortem voice, not an incident timeline. There is no single detonation event, no minute-by-minute timeline, no pager-flooded-alert moment — the mode is chronic low-level user friction surfaced via social media and cleaned up deliberately.
  • Deliberate omission of mechanism detail. The layered diagram is simplified, the composite signals are described in the abstract, HAProxy's role is named but not expanded — all to avoid giving abusive actors information about the defense surface. This limits how much architectural content a third party can extract.
  • Asymmetric evidence: the article shows four charts of FP rate over 60 minutes before cleanup, but no charts of the post-cleanup state. The improvement is asserted but not quantified.
  • "Composite signals occasionally produce false positives" is named but the composite-signal design is not — how many signals are combined, what operators combine them, whether AND/OR/weighted, what revision cadence the rules have.
  • The lifecycle diagram is a four-step cartoon, not a formal state machine — no timer semantics, no escalation path, no integration point with the internal change-management system.
  • Structural commitments with unspecified rollout discipline: the three workstreams are named at the paragraph level. The post does not commit to a date, disclose whether they block on internal tooling work, or say which teams own them.
  • No disclosed feedback channel for currently-affected users. The post thanks social-media reporters but doesn't publish a "if you're still hitting this, here's what to do" handle — users who re-encounter FPs after publication have no stated path.
  • The fingerprint + business-logic composite is not uniquely GitHub-shaped. The same structure is common in bot-management systems (see concepts/fingerprinting-vector, systems/cloudflare-bot-management) — this post's contribution is the lifecycle framing, not the detection design.

Source

Last updated · 542 distilled / 1,571 read