Figma: How We Migrated onto K8s in Less Than 12 Months¶
Summary¶
Figma migrated its core compute platform from AWS ECS on EC2 to AWS EKS (Kubernetes) in under 12 months (Q1 2023 plan → January 2024 majority-cutover) with only minor incidents and limited customer impact. By early 2023 all services already ran in containers on ECS; the choice was not whether to containerize but whether ECS's ceiling could be lived with. It couldn't: workarounds for missing Kubernetes primitives (StatefulSets for etcd), no Helm support blocking OSS adoption (Temporal), limited auto-scaling, poor-graceful-termination, and absence from the CNCF ecosystem were the accumulating tax. The post is a playbook for tight-scope migrations: change the substrate, keep the abstraction, defer improvements to fast-follows, and use a three-active-cluster topology as a standing blast-radius reducer.
Figma is not a microservices shop — they have a small number of "powerful core services" — which is what made migrating to Kubernetes "digestible." This is the enabling condition the post repeatedly returns to.
Key takeaways¶
-
Migrate the substrate, keep the abstraction (tight-scope migration). The default rule: change only the core system being swapped; keep deploy tooling, service definitions, and developer experience identical. Everything else has second-order effects that blow up timelines. Two explicit exceptions to the rule: (a) when matching old behavior is more expensive than burning the second- order effect, and (b) one-way-door decisions where retrofitting later is expensive. (Source: patterns/scoped-migration-with-fast-follows, concepts/tight-migration-scope)
-
Three active EKS clusters per environment, each receiving real traffic for every service. Bad operator actions (example: destroyed and recreated CoreDNS on one cluster) reduce to ~1/3 of requests affected rather than a full outage; with retries most downstream services see "minimal disruption." This is a standing-state pattern, not a migration artifact — it now carries forward as a reliability primitive. (Source: patterns/multi-cluster-active-active-redundancy, concepts/active-multi-cluster-blast-radius)
-
Unified single-step service definition (Bazel config → generated YAML) replaced two-step Terraform + ECS-deploy. ECS required developers to
terraform applya zero-instance "template" ECS task set, then deploy the real task set that cloned the template and substituted image hash — any change (like adding an env var) required those two operations in exactly that order. Commonly forgotten → many bugs. On EKS, Figma defined services in a single Bazel configuration file; CI generated the Service / Ingress / etc. YAMLs; their in-house deploy system applied them. One-step. This was one of the two explicit "break the rule" calls. (Source: patterns/single-source-service-definition) -
Node auto-scaling (via Karpenter) scoped in, pod auto-scaling (via Keda) scoped out. ECS-on-EC2 was over-provisioned flat to handle deploy surges and peak load (expensive). Karpenter came online as part of the migration, because the cost savings justified the added scope for little extra work. Keda / HPA came later, as a fast-follow, because it would have added second-order risk to the critical-path migration. (Source: systems/karpenter, systems/keda)
-
Log-forwarding redesign (Vector sidecar) also deferred to a fast-follow. ECS pipeline wrote to CloudWatch → Lambda transform (redaction, tagging) → Datadog + Snowflake. Vector as an EKS sidecar could replace the Lambda-based forwarder and skip CloudWatch intermediate cost. The team deliberately left this out of migration scope to avoid porting the forwarder's logic into Vector configuration while other systems were also moving. Completed post-migration as a fast-follow. (Source: systems/vector, patterns/scoped-migration-with-fast-follows)
-
Load-test the new cluster at real scale before real workloads. Figma created a "Hello, World" service and scaled it to the same number of pods as their largest services. This surfaced required tuning for core compute services — systems/kyverno (cluster security policies) had to be resized or it slowed down new-pod startup. Same pattern used earlier by Dropbox Robinhood and others: prove the platform at scale before giving it real tenants. (Source: patterns/load-test-at-scale)
-
Weighted DNS to incrementally shift traffic per-service from ECS to EKS, one service at a time, with fine-grained revert. Each service transitioned gradually with DNS weights; failing behavior was reverted by adjusting weights back. Small blast radius at every step, matched to the incremental schedule (Q4 2023 → Jan 2024 main cutovers). (Source: patterns/weighted-dns-traffic-shifting)
-
Migrate one service to production before staging is done. Running real traffic through the new platform surfaced end-to-end bottlenecks and bugs that staging couldn't. "Well worth it" in Figma's framing — intentionally violates the usual staging-first ordering because incremental revertability made it safe.
-
Golden path with escape hatches, not raw YAML. Defining services directly in YAML was "confusing." Figma instead defined a golden path (Bazel config → generated YAMLs) with explicit customization surfaces for special cases. Enforces consistency by default without blocking legitimate complexity. (Source: patterns/golden-path-with-escapes)
-
Collaborate with service owners on monitoring/alerting updates, don't do it for them. Service teams know their health signals best. Platform team handled infrastructure; service owners owned alert parity across the cutover. Early buy-in — discussing options and tradeoffs before starting — helped.
Post-migration follow-ups (explicitly called out)¶
- Tooling UX regressions emerged post-cutover. Two pain points dominated: (a) three-cluster topology made users specify a cluster name for every command; (b) RBAC role granularity (co-designed with security team, principle of least privilege) meant users had to know which role to assume for which task. Fix: tooling auto-infers both the right cluster and the right role. Paused other work to address; key during middle-of-the-night incidents.
- Keda-based horizontal pod auto-scaling shipped post-migration.
- Vector-based log-forwarding simplification shipped post-migration.
- Graviton migration for the most expensive service (save money, open a path for future services to run on Graviton).
- In-flight / planned: service-mesh exploration (likely Envoy-based) for reliability + observability of internal networking; moving more resources out of Terraform into AWS Controllers for Kubernetes (ACK) to further unify the stack; dev-environment unification (run-in-dev matches run-in-prod).
Quantitative + architectural details¶
- Timeline: Q1 2023 plan → Q2 2023 staging + first service → Q3 2023 productionalization + load tests → Q4 2023 + early Jan 2024 majority cutover. Under 12 months plan-to-majority-cutover.
- Services migrated by January 2024 include: the core monolith (business logic), the multiplayer service (one of the most complex — handles live collaborative Figma file edits), the Livegraph suite of services (newer real-time-updates-to-clients infra). The post implicitly confirms Figma's "small number of powerful core services" framing.
- Three active EKS clusters per environment, all receiving traffic for every service. Operations happen cluster-by-cluster (e.g. rolling config change across the three).
- Incident referenced: operator destroyed + recreated CoreDNS on one cluster → 1/3 of requests affected → most downstream retries succeeded against the other 2/3. Pre-EKS, same incident would have been full outage.
- Why Kubernetes over staying on ECS (listed in the post): StatefulSets (stateful workloads like etcd clusters — ECS needed custom membership-update code, "fragile and hard to maintain"); Helm charts for OSS (Temporal example); graceful EC2 cordon-and-drain on EKS vs painful on ECS-on-EC2; CNCF auto-scaling ecosystem (Keda / Karpenter); likely future service-mesh (Istio); AWS's expected relative investment in EKS (vendor-agnostic user base) vs ECS (AWS-only); easier hire of pre-trained K8s engineers; reduced vendor lock-in relative to ECS.
- Explicit non-argument: "We're not a microservices company and we don't plan to become one." ~Small service count was the key feasibility factor.
- Explicit non-goals for migration scope: pod autoscaling, Vector log forwarding, Graviton migration — all deferred to fast-follows.
Caveats¶
- Small-service-count is load-bearing. A microservices-at-scale org cannot transcribe this playbook directly — the 12-month timeline and tight-scope discipline rely on there being a small number of services to cut over.
- Three-cluster active-active has real cost. Triple cluster overhead (control plane, system pods like Kyverno/CoreDNS) plus service replication overhead. Budgeted against reliability gain.
- The tooling-UX blowup was unplanned. Even with tight-scoping rigour, UX regressions emerged from (a) the three-cluster topology itself and (b) RBAC granularity added for security. The fix was straightforward (auto-infer cluster + role) but the incident reveals a latent gap: platform UX review should be part of migration dry-runs even when platform UX is nominally "unchanged."
- Post says "most services migrated" by Jan 2024, not all. Some tail of services was still being migrated at the time of writing.
- Tier 3-equivalent source (Figma is not in AGENTS.md's formal Tier lists). Judged on-topic: distributed-systems internals (orchestrator swap, multi-cluster blast radius, service mesh, auto-scaling, stateful workloads, incremental migration) with production incident content (CoreDNS destruction).