Skip to content

GITHUB 2026-04-16 Tier 2

Read original ↗

GitHub Engineering — How GitHub uses eBPF to improve deployment safety

Summary

GitHub hosts its own source code on github.com (they are themselves their largest customer) — so any new host-based deployment system has a built-in circular dependency: if github.com is down, the scripts that deploy fixes may not be able to run because they implicitly talk back to github.com. Historically GitHub mitigated this with a read-only code mirror and prebuilt rollback assets, but deploy scripts themselves could still reintroduce the bug — pulling a release, calling an internal service that pulls a release, or running a tool that background-checks for updates. GitHub's new host-based deploy system uses eBPF to selectively block github.com traffic from the deploy-script process only, leaving customer-traffic-serving processes on the same host unaffected. The architecture places the deploy script in a dedicated Linux cGroup, attaches two eBPF program types to it (BPF_PROG_TYPE_CGROUP_SKB for egress packet control + BPF_PROG_TYPE_CGROUP_SOCK_ADDR for socket-creation rewrite), redirects DNS syscalls via connect4 to a userspace DNS proxy that evaluates a hostname blocklist, and correlates blocked queries back to the originating process via a DNS transaction-ID → PID eBPF map. Rolled out over six months; live and catching new circular dependencies pre-incident.

Key takeaways

  1. Dogfooding + host-stateful services = baseline circular- dependency surface. GitHub deploys itself on GitHub; stateful hosts serve customer traffic during rolling deploys, drains, restarts. You can't just block github.com at the host level, because legitimate customer-traffic code paths on the same host need it. The filter has to be process-scoped, not host-scoped.

  2. Three types of circular dependency named and catalogued. Direct (deploy script pulls tool release from GitHub), hidden (tool already on disk but calls home for updates), transient (deploy script calls internal service X which then pulls from GitHub). The taxonomy matters because each class needs different detection — tool-inventory audits find (2), dependency-graph walks find (3), static-source review only catches (1).

  3. Structural fix replaces tribal knowledge. Pre-eBPF, "the onus has been on every team who owns stateful hosts to review their deployment scripts and identify circular dependencies. In practice, however, many dependencies aren't identified until an incident occurs, which can delay recovery." Audit-at-review scaled with team count; a structural enforcement primitive was needed. (Source: sources/2026-04-16-github-ebpf-deployment-safety)

  4. Linux cGroups are the process- isolation primitive, not Docker. A cGroup is a Linux kernel primitive enforcing resource limits + isolation for sets of processes; Docker uses it heavily but is not required. GitHub creates a cGroup, places only the deploy script inside, and scopes the eBPF firewall to that cGroup. Bonus: same cGroup enforces CPU + memory limits, preventing a runaway deploy script from starving customer workloads.

  5. Two eBPF program types compose into a DNS-name-aware firewall. BPF_PROG_TYPE_CGROUP_SKB hooks cGroup egress at the packet layer — can filter/drop but only knows IPs. BPF_PROG_TYPE_CGROUP_SOCK_ADDR hooks socket creation (connect4 syscall) and can rewrite the destination IP+port — GitHub redirects every deploy-script DNS query (UDP/53) to 127.0.0.1:<dns_proxy_port>. The userspace proxy evaluates the queried hostname against a blocklist; the CGROUP_SKB program uses that decision (communicated via eBPF map) to allow or drop the eventual connection. Hostnames not IPs is the load-bearing pivot — "given the breadth of GitHub's systems and rate of change, keeping an up-to-date block IP list would be very hard."

  6. DNS-transaction-ID → PID correlation gives per-blocked-request attribution. Inside the CGROUP_SKB program, GitHub pulls the DNS transaction ID from the __sk_buff + the PID from bpf_get_current_pid_tgid(), stores the pair in an eBPF map. When the userspace proxy blocks a query, it looks up the PID, reads /proc/<pid>/cmdline, and logs "WARN DNS BLOCKED … domain=github.com. pid=266767 cmd=\"curl github.com\" firewallMethod=blocklist". The owning team sees which command in their deploy script triggered the block, not just that something did — an audit-log primitive for the whole deployment. Also yields an audit list of all domains contacted during a deployment as a side effect.

  7. Proof-of-concept → production shaped by cilium/ebpf. GitHub built the PoC in Go using the cilium/ebpf library: //go:generate go tool bpf2go auto-compiles C eBPF code and generates the Go struct binding, so shipping reduces to go build. Sample PoC published at github.com/lawrencegripper/ebpf-cgroup-firewall — current production implementation progressed further but the PoC shows the core loop. Wire-ups link.AttachCgroup with AttachCGroupInet4Connect / AttachCGroupInetEgress attachment types.

  8. Six-month rollout; live and generating new findings. "Our new circular dependency detection process is live after a six-month rollout. Now, if a team accidentally adds a problematic dependency, or if an existing binary tool we use takes a new dependency, the tooling will detect that problem and flag it to the team." Cast as ongoing — GitHub expects to keep finding dependency classes the filter doesn't catch. Staged rollout discipline is implicit; not called out as a separate subsystem like the 2025 Issues-search rewrite's three-layer validation.

Architecture

Shape

  Deploy script (PID 266767)
        │ placed in dedicated cGroup at /sys/fs/cgroup/…
  ┌─────────────────────────────────────────────────────────┐
  │ cGroup                                                  │
  │                                                         │
  │   write(DNS query to :53) ──────┐                       │
  │                                 │                       │
  │   connect(github.com:443) ──────┼──► CGROUP_SOCK_ADDR ──┼──► rewrite :53 → 127.0.0.1:<dns_proxy_port>
  │                                 │                       │
  │   packet egress ────────────────┼──► CGROUP_SKB ────────┼──► allow/drop based on eBPF map
  │                                 │                       │
  └─────────────────────────────────┼───────────────────────┘
                         ┌──────────▼──────────┐
                         │ Userspace DNS proxy │
                         │ - reads blocklist   │
                         │ - evaluates domain  │
                         │ - updates eBPF maps │
                         │ - logs WARN DNS BLOCKED with cmd
                         └─────────────────────┘

Control plane (userspace): blocklist + proxy, compiles policy into eBPF maps. Data plane (kernel): CGROUP_SKB + CGROUP_SOCK_ADDR eBPF programs consult maps per event. Canonical CP/DP split inside an eBPF program boundary.

Concrete Go + C sketches from the post

Go side — attach a CGROUP_SKB egress counter to a cGroup:

l, err := link.AttachCgroup(link.CgroupOptions{
    Path:    "/sys/fs/cgroup/system.slice",
    Attach:  ebpf.AttachCGroupInetEgress,
    Program: objs.CountEgressPackets,
})

eBPF C side — connect4 rewrite sending all port-53 traffic to localhost (real impl uses a parameterised proxy port):

SEC("cgroup/connect4")
int connect4(struct bpf_sock_addr *ctx) {
  if (ctx->user_port == bpf_htons(53)) {
    ctx->user_ip4 = const_mitm_proxy_address;
    ctx->user_port = bpf_htons(const_dns_proxy_port);
  }
  return 1;
}

eBPF C side — DNS TXID → PID correlation in CGROUP_SKB:

__u32 pid = bpf_get_current_pid_tgid() >> 32;
__u16 skb_read_offset = sizeof(struct iphdr) + sizeof(struct udphdr);
__u16 dns_transaction_id =
    get_transaction_id_from_dns_header(skb, skb_read_offset);
if (pid && dns_transaction_id != 0) {
  bpf_map_update_elem(&dns_transaction_id_to_pid, &dns_transaction_id,
                      pid, BPF_ANY);
}

Systems / concepts / patterns surfaced

  • systems/ebpf — GitHub is a new named production user on the cgroup-attached security-policy axis (distinct from systems/datadog-workload-protection's syscall-hook observability axis); introduces the two cGroup eBPF program types (CGROUP_SKB and CGROUP_SOCK_ADDR) to the wiki.
  • systems/github — new workload: deployment-safety firewall in the host-based deploy system.
  • systems/cilium — referenced tangentially (cilium/ebpf Go library is from the Cilium project) though the production system here is GitHub-built, not on top of Cilium CNI.
  • concepts/circular-dependencynew concept page; GitHub post is the canonical source for the three-class taxonomy (direct / hidden / transient) in the deployment context.
  • concepts/linux-cgroupnew concept page; Linux cGroup as a first-class isolation primitive independent of Docker.
  • concepts/in-kernel-filtering — new production instance: the filter is cGroup-scoped rather than host-wide, which is the refinement that unlocks the pattern for dogfooded multi-tenant hosts.
  • concepts/ebpf-verifier — implicit substrate (verified at load time), not called out in this post.
  • concepts/control-plane-data-plane-separation — the userspace DNS proxy (CP: reads blocklist config) + eBPF maps + eBPF programs (DP: per-event enforcement) is a textbook instance.
  • concepts/egress-sni-filteringsibling approach; GitHub picks DNS-hostname filtering rather than TLS-SNI filtering because blocking at DNS gives per-process attribution via TXID↔PID and doesn't require positioning a middlebox in the TLS path. Different point in the hostname-filtering design space than the Generali AWS Network Firewall setup.
  • patterns/cgroup-scoped-egress-firewallnew pattern page; GitHub's canonical shape for per-process conditional network filtering via eBPF cGroup programs.
  • patterns/dns-proxy-for-hostname-filteringnew pattern page; CGROUP_SOCK_ADDR syscall rewrite + userspace proxy
  • TXID↔PID map.
  • patterns/two-stage-evaluation — arguably a third instance: kernel-side packet path (cheap, per-packet drop decision) + userspace DNS proxy (rich, per-query hostname evaluation against a mutable blocklist). Similar shape to Datadog FIM's approver/discarder + Agent rule engine.
  • patterns/staged-rollout — six-month rollout explicitly referenced, mechanics not detailed.

Operational numbers

  • Rollout duration: 6 months from design to production.
  • Block event log format: WARN DNS BLOCKED reason=FromDNSRequest blocked=true blockedAt=dns domain=github.com. pid=266767 cmd="curl github.com " firewallMethod=blocklist
  • No fleet scale / throughput / false-positive-rate / block-rate numbers disclosed. Qualitative claims only — "a more stable GitHub and faster mean time to recovery during incidents."

Caveats / missing detail

  • Production implementation differs from the PoC; only the PoC is published open-source. Production block-list storage, update cadence, policy-authoring surface, emergency bypass mechanism (if any), and integration with deployment telemetry are all unspecified.
  • No explicit CI / verifier-matrix discussion — unlike Datadog's Workload Protection retrospective, GitHub doesn't catalogue kernel-version portability operational costs. May be simpler because the deploy-system fleet's kernel range is tighter.
  • No mention of a staged-rollout validation harness like patterns/dark-ship-for-behavior-parity; the six-month duration suggests staged-rollout discipline, but its shape (cohort percentages, shadow mode, observation window) isn't published.
  • TLS-session egress post-DNS isn't explicitly covered — the post focuses on DNS blocking. A compromised deploy script that already has an IP can attempt to connect without DNS lookup; the CGROUP_SKB + eBPF-maps primitive supports IP-level drop, but how dynamic / SNI-aware policy is enforced isn't detailed. Compare with concepts/egress-sni-filtering's TLS-layer approach.
  • Tangled-with-userspace-DNS-proxy blast radius: the proxy is itself a dependency on the deploy path. Its own failure mode relative to deploy-script behaviour (fail-open vs fail-closed, kill-switch, circuit-breaker) isn't discussed.
  • No quantification of the circular-dependency backlog found"if a team accidentally adds a problematic dependency..." Numbers on how many circular dependencies the tool has caught pre-incident would make the impact case concrete; not shared.

Source

Last updated · 200 distilled / 1,178 read