Skip to content

GRAFANA

Post-incident review for TanStack npm supply chain ransom incident

Summary

Grafana Labs publishes a detailed post-incident review of a supply chain compromise originating from the TanStack npm "Mini Shai-Hulud" campaign on May 11, 2026. The attack exploited self-hosted GitHub Actions runners to leak credentials, which were then used to clone Grafana's entire repository collection (including private repos). The threat actor demanded ransom; Grafana refused to pay. The article documents the full incident timeline, remediation scope (1,500 security PRs, 280 GitHub app audits, 1,200 repo scans, 2,300 PR reviews in one critical repo), and the architectural hardening that followed — most notably the deployment of a token broker for short-lived, finely-scoped credentials, the compartmentalization of GitHub organizations, and the retirement of direct DockerHub pushes in favor of Google Cloud Artifact Registry.

Key takeaways

  1. Self-hosted runners as credential-leak vector. The initial compromise vector was malicious code executing on Grafana's self-hosted GitHub Actions runners via the TanStack supply chain attack (Shai Hulud campaign). Self-hosted runners have access to long-lived credentials that GitHub-hosted runners don't — making them a higher-value target.

  2. Incomplete credential rotation creates residual risk. Grafana believed they had rotated all compromised credentials on May 11 but missed one. The overlooked credential was used five days later (May 16) to clone the entire repository collection. The lesson: exhaustive credential inventorying is harder than rotation mechanics.

  3. Open source reduces ransom leverage. Because Grafana Labs is predominantly open-source, the exfiltrated code's ransom value was limited to private repos (internal tools, Grafana Cloud features). They followed FBI guidance and did not pay.

  4. Global code freeze as containment. Grafana froze all non-critical code and deployment changes on May 18 — a ~7-day freeze — until security hardening was complete. Repos could only thaw after being fully reviewed AND transitioned to the token-broker credential model.

  5. Token broker for short-lived, scoped credentials (primary architectural remediation). The central hardening measure: repositories must use a GitHub application token broker that issues short-term, finely-scoped credentials per-operation, replacing long-lived secrets stored in CI configuration. This is the same pattern as patterns/short-lived-oidc-credentials-in-ci but implemented via a broker service rather than direct OIDC federation.

  6. Scale of audit. Post-incident verification touched: 1,500 security-focused PR reviews, 280 GitHub applications audited (permissions stripped, several removed), 1,200 repositories scanned for tampering, 2,300 PR reviews in a single critical repo, infrastructure audits, legacy system retirements.

  7. Compartmentalized GitHub organizations. Post-incident, Grafana began compartmentalizing GitHub organizations and isolating all archived repos into a dedicated organization with Actions disabled — reducing the blast radius of any single credential compromise.

  8. Registry migration: DockerHub → Google Cloud Artifact Registry. On May 27, Grafana transitioned from repos pushing images directly to DockerHub to pushing to Google Cloud Artifact Registry — eliminating a trust boundary (DockerHub credentials) and gaining tighter access control within their cloud environment.

  9. Independent audit via Mandiant. Grafana engaged Mandiant for an independent investigation starting June 1. Mandiant confirmed "no evidence of code tampering or repository poisoning within public organizations or production repositories delivered to end users."

  10. No customer/production impact confirmed. Despite full repo exfiltration, the attack was limited to the GitHub environment. No unauthorized access to customer production systems; Grafana Cloud platform unaffected; codebase downloaded but not altered.

Operational numbers

Metric Value
Time from first exploit to detection ~5 days (May 11 → May 16)
Time from detection to incident declaration ~8 hours (08:30 → 17:39 May 16)
Time to suspend all GitHub apps ~2 hours (19:33 → 21:10 May 16)
Code freeze duration ~8 days (May 18 → May 26)
Security hardening week May 25 → June 2
Security PRs reviewed 1,500
GitHub applications audited 280
Repositories scanned for tampering 1,200
PR reviews in single critical repo 2,300
External investigation (Mandiant) June 1 → June 18

Incident timeline (UTC)

  • May 11 19:21 — First malicious code executed on self-hosted runners (Shai Hulud). Credentials leaked. Rotated (incompletely).
  • May 14 07:21 — First malicious commit via grafana-delivery-bot using leaked credential.
  • May 14 13:28 — Data exfiltration begins.
  • May 15 20:57 — Extortion demand published.
  • May 16 08:30 — Security team alerted. Begins confirmation.
  • May 16 17:39 — Compromise confirmed; incident declared.
  • May 16 19:33 — All known affected credentials/apps suspended. Full rotation begins.
  • May 16 21:10 — All GitHub applications suspended.
  • May 17 16:40 — All threat-actor code changes identified and reverted.
  • May 17 16:52 — Root cause and attack chain identified.
  • May 17 23:23 — Last potentially accessible credential confirmed rotated.
  • May 18 03:08 — Global code freeze begins.
  • May 25 — All-engineering security hardening week commences.
  • May 26 10:58 — Commit review complete; thawing begins (per-repo, gated on token-broker transition).
  • May 27 10:54 — DockerHub → Google Cloud Artifact Registry migration.
  • May 27 — Internal investigation complete.
  • June 2 — Security hardening week concludes.
  • June 3 20:43 — Data loss repository review completed.
  • June 18 — Mandiant investigation complete, corroborating internal findings.

Architectural hardening (post-incident)

  1. Token broker — centralized service issuing short-term, finely-scoped GitHub App credentials per-job. Replaces static secrets in CI.
  2. Fine-grained access controls — principle of least privilege across GitHub app permissions (280 apps audited, several removed).
  3. Compartmentalized organizations — isolating archived repos into a dedicated GitHub org with Actions disabled.
  4. Registry migration — DockerHub → Google Cloud Artifact Registry for container image pushes.
  5. Tightly scoped GitHub Actions — retired certain Actions; remaining ones use short-lived tokens.
  6. Static analysis gates — additional alerting and static analysis in CI pipeline.
  7. Legacy system retirement — infrastructure audits resulted in decommissioning legacy systems.

Caveats

  • No disclosure of which specific credential was missed in the initial rotation.
  • No details on the token broker implementation (architecture, latency overhead, failover).
  • No quantitative latency/throughput impact numbers from the hardening measures.
  • The "Mini Shai-Hulud" campaign details are referenced but not explained (covered in the earlier security update post).
  • No disclosure of self-hosted runner isolation architecture pre-incident.

Source

Last updated · 559 distilled / 1,651 read