Skip to content

FIGMA 2026-04-21 Tier 3-equivalent

Read original ↗

Figma — Server-side sandboxing — Virtual machines

Summary

Part 2 of Figma's security-engineering 3-part series on server-side sandboxing (aka workload isolation) — the practice of accepting that vulnerabilities will exist and minimising blast radius instead of trying to prevent them outright. This post focuses on the virtual-machine security model: what it buys you, what it costs, and how Figma uses it in production via AWS Lambda (Firecracker) for stateless link-metadata + image-fetch workloads. Complements Part 1 (intro, now ingested — see sources/2026-04-21-figma-server-side-sandboxing-an-introduction) and Part 3 (containers + seccomp, now ingested — see sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp).

Key takeaways

  1. Two relevant security boundaries in the VM model — the hypervisor (separates host ↔ guest and guest ↔ guest) and the VM's own permissions (what a compromised workload inside the VM can reach). Both must be reasoned about; defence-in-depth assumes the first might fail. (Source: this article)

  2. Hypervisor attack surface is large and mostly unmodifiable by users. Hypervisors mediate many OS + hardware operations, so they expose broad attack surface that occasionally yields VM escapes (guest → host takeover or guest ↔ guest information leakage). Most IaaS tenants rely on the hypervisor boundary implicitly anyway — "If your system runs in this type of cloud provider and you don't limit yourself to bare-metal instances, then your security model has to rely on the hypervisor security boundary to some extent anyway." See concepts/vm-escape.

  3. VM permissions matter independently of escape risk. Even with a perfect hypervisor, a compromised job inside a VM can still make outbound network calls, exfiltrate data, and use the VM's credentials. The control you do have is restricting the VM itself: minimise network egress, minimise IAM permissions, bound credential and VM lifetime. See patterns/minimize-vm-permissions.

  4. VMs are the heavyweight sandboxing choice. Trade-off profile:

  5. Compatibility — most workloads run unmodified; required when you need full OS features (e.g., running office suites / browsers to convert documents to PDF).
  6. Performance cost — granular per-workload isolation means per-workload setup/teardown; cold-start latency is material. A warmed-up VM pool helps but adds orchestration complexity (concepts/cold-start).
  7. Development cost — debugging is usually fine; cluster operations are hard (deep internals, per-VM state tracking, routing decisions). Specialised micro-VMs like Firecracker may require a custom runtime.

  8. Figma uses AWS Lambda (→ Firecracker) for its sandboxing-is-appropriate cases. Two concrete production workloads named:

  9. Link metadata fetcher for link previews in FigJam — fetches third-party URLs, resizes/converts images (ImageMagick) for the Figma frontend.
  10. Canvas image fetcher — fetches external images used in the Figma canvas. Both run with no special privileges, outside the Figma production VPC, so an ImageMagick or fetch-logic exploit grants no pivot into Figma's internal services. Exemplar of "minimise VM permissions so escape risk isn't the only barrier."

  11. Latency vs isolation trade-off is Figma-visible. Figma has "minimal control over routing at this level" — AWS reuses individual Lambda VMs for multiple requests from the same tenant because Firecracker boot times (fast as they are) are still too slow to pay on every synchronous core request. This VM-reuse-within-tenant is explicitly called out as a reasonable security trade-off for Figma's use case.

  12. Specific gotchas Figma hardened against in Lambda:

  13. Localhost runtime API — the Lambda environment includes an HTTP listener on localhost that returns the triggering request contents (or lets any in-VM caller forge a response). An SSRF vulnerability in application code would expose this; Figma explicitly ensures application code cannot make localhost requests.
  14. Not "raw" compute — Lambdas run inside a cloud environment and can be mis-configured to have special privileges (internal-network placement or IAM permissions to touch other cloud resources). Default-deny posture required.
  15. Concurrency limitreserved concurrency is a shared quota across all Lambdas in an AWS account + region. Contention on it is a real operational concern.

  16. First-deploy latency reality. Initial Lambda call took up to 10 seconds; after warming, average latency dropped materially, but "we also had to invest direct engineering efforts into ensuring that we were minimizing startup and processing costs as much as possible."

  17. Broader sandboxing-choice frame (from part 1, reaffirmed here): environment / security + performance / development cost & friction / maintenance & operational overhead are the four axes for choosing among sandboxing primitives. No single winner — pick the lightest option that gives you the boundary you actually need.

Figma-specific numbers disclosed

  • None (no p50/p99 latency distribution, no QPS, no invocation cost, no concurrency-limit numeric).
  • Only quantitative datapoint: "up to 10 seconds" for first un-warmed call before tuning.

Caveats

  • Article is Part 2 of 3 — the container + seccomp companion (Part 3) is not yet ingested; full three-primitive comparison lives across the series.
  • No absolute numbers disclosed (cost, QPS, latency percentiles, concurrency-limit value) beyond the "up to 10 s" cold-start anecdote.
  • Figma's internal platform-engineering / SRE view of running the Lambda fleet (concurrency-quota management, routing policy, observability) is abstracted into "additional development and maintenance costs" without operational detail.
  • "Minimal control over routing" framing conflates AWS's Firecracker VM-reuse heuristic with routing; Figma has no API-level say in which VM a given request lands on, only in the broader decision to use Lambda at all.

Source

Last updated · 200 distilled / 1,178 read