Skip to content

FIGMA 2026-04-21 Tier 3-equivalent

Read original ↗

Figma — Server-side sandboxing — Containers and seccomp

Summary

Part 3 of Figma's security-engineering 3-part series on server-side sandboxing (aka workload isolation) — the practice of accepting that vulnerabilities will exist in code that processes user-supplied input (images, documents, SVGs) and minimising blast radius instead of trying to prevent them outright. Where Part 2 covered the VM row of the primitive table, Part 3 covers the remaining two: containers (kernel-namespace + cgroup isolation) and seccomp-only (syscall-allowlist isolation). Figma's production exemplar is RenderServer — a C++ headless editor that converts Figma files to images / SVGs — which runs inside nsjail (namespaces + cgroups + seccomp-bpf composition) for the full-featured GPU-accelerated path, and inside a seccomp-only sandbox for certain non-GPU paths after a source-code refactor that reordered all open() calls before image processing.

Key takeaways

  1. Containers are not automatically secure sandboxes. The level of isolation depends on three factors: runtime implementation (bugs in runC / containerd), OS primitives exposed to the runtime (kernel namespaces, cgroups, seccomp, SELinux, AppArmor — and kernel vulnerabilities in them), and runtime configuration (user-chosen). "Unlike commodity VM solutions, containers place a much greater responsibility on the user to correctly configure the desired level of isolation." More control = more room for mistakes. (Source: this article)

  2. Container escape has three attack-surface components. A kernel vulnerability, a runtime-implementation bug, and/or a runtime misconfiguration can each allow a workload to break out and modify files / execute code on the host. Dirty COW, Dirty Pipe, and CVE references in the post are cited as recent kernel-level examples. See concepts/container-escape.

  3. Seccomp-only sandboxes are the lightest primitive but require workload discipline. The premise: "many programs do pure computation, and thus do not need dynamic access to the filesystem or to make network calls at all." For those, a seccomp allowlist restricting syscalls to (ideally) write-to-open-fd, exit, memory allocation, and time-fetching gives extremely strong isolation with negligible overhead. Seccomp is used in Android, Chrome, Firefox and composes with containerisation in systems/nsjail and systems/firejail. See concepts/seccomp / concepts/syscall-allowlist.

  4. The seccomp allowlist is brittle by construction. "Every incremental increase in allowed system calls results in extra kernel attack surface to consider." Figma restricts programs to writing output to already-open file descriptors, exiting, allocating memory, and fetching current time — avoiding the filesystem, network/socket, and keychain surfaces entirely. The original 1997 seccomp shipped with only read, write, exit, sigreturn; real programs need more, and each addition is a conscious attack-surface expansion.

  5. Seccomp's pointer-dereference limitation forces program refactors. Seccomp-bpf "can only filter syscall arguments at the top level and can't dereference pointer arguments." It cannot filter openat by path — the path is a pointer. So a program that needs to open files dynamically through user-input-driven codepaths cannot be seccomp-sandboxed without rewriting to open all files before the dangerous processing step. Figma refactored RenderServer's file I/O to do exactly this. See patterns/refactor-for-seccomp-filter.

  6. VM ↔ container ↔ seccomp is not a linear scale. Direct comparison is "more complicated and nuanced than with VMs." VMs have a small hypervisor attack surface but less fine-grained control + higher performance cost; containers have large kernel attack surface but fine-grained cgroup / namespace / seccomp / MAC controls; gVisor interposes its own hardened kernel between the host kernel and the container process as a middle option; seccomp-only can achieve "extremely strong isolation if only minimal syscalls are allowed" with the lowest overhead. Orchestration + correct configuration are the recurring tax.

  7. RenderServer at Figma uses nsjail as a drop-in solution. Explicit rejection of Docker: "we would need to create a new service that sandboxes the RenderServer binary inside a secure Docker configuration, create an orchestration system to manage the service, and re-architect various services to make a network call to the RenderServer service instead of invoking the binary directly." Per user request, nsjail starts RenderServer in new user / pid / mount / network namespaces with no network access, specific mount points only (input file, libraries, output folder), and seccomp-bpf. See systems/nsjail.

  8. Rollout surfaced real configuration foot-guns.

  9. Default rlimit_fsize = 1 MB silently truncated output files for large-image inputs → job errors correlated with exactly-1-MB outputs. Fix: one-line config change after reading docs carefully.
  10. Seccomp allowlist needed several iterations — rare codepaths in RenderServer's complex C++ codebase triggered syscalls not hit during testing or internal use. "Kernel logs will indicate when a process is killed by seccomp and which syscall caused the problem, without providing much more context."

  11. Seccomp-only RenderServer traded operational simplicity for engineering invariants. For non-GPU paths Figma refactored RenderServer so all file opens occur before any image processing happens on potentially dangerous user input, then applied a restrictive seccomp filter via libseccomp. Result:

  12. ✅ Easier to test and debug than nsjail.
  13. ✅ Significantly faster at runtime.
  14. ❌ Locks RenderServer into a single-threaded model.
  15. ❌ Cannot dynamically load fonts or images later in runtime.

  16. Startup cost of container sandboxes is real but bounded. "The startup time of nsjail is typically on the order of small fractions of a second, tens to low hundreds of milliseconds. There is, however, still a long tail of startup times, and initializing a language runtime within the container can take substantially longer." Lower overhead than VMs, higher than seccomp-only.

Operational numbers disclosed

  • nsjail startup latency: tens to low hundreds of milliseconds; long tail; language-runtime init can extend significantly.
  • Seccomp allowlist adjustments during rollout: "several times" — driven by rare codepaths hit in production that weren't exercised in testing.
  • rlimit_fsize default: 1 MB (nsjail default, tripped by large-image outputs).
  • Seccomp allowlist target at Figma: write to already-open fds, exit, memory allocation, clock_gettime — every additional syscall is a conscious attack-surface expansion.

Systems / concepts / patterns extracted

Systems:

  • systems/nsjail — Google's cmdline tool stacking Linux namespaces + capabilities + filesystem restrictions + cgroups + resource limits + seccomp. Figma's production sandbox for RenderServer.
  • systems/firejail — SUID-based sandbox, referenced as the composition sibling.
  • systems/docker — rejected for RenderServer due to orchestration overhead; named as the container platform whose runC runtime exposes namespaces/cgroups/seccomp/SELinux/AppArmor.
  • systems/runc — the Docker default runtime whose bugs / misconfigurations contribute to the container-escape attack surface.
  • systems/gvisor — hardened-kernel interposition between host kernel and container process; reduces container attack surface at the cost of interpretation overhead + compatibility gaps.
  • systems/figma-renderserver — the subject of the sandboxing decision; C++ headless Figma editor used for thumbnailing / file-format conversion.

Concepts:

Patterns:

  • patterns/refactor-for-seccomp-filter — reorder all dangerous syscalls (file opens, socket creation, etc.) to happen before user-input processing, then apply a restrictive seccomp filter that denies those syscalls for the rest of the process lifetime. Canonical example: RenderServer's SVG-export path.
  • patterns/seccomp-bpf-container-composition — combine namespaces + cgroups + seccomp-bpf in one sandbox (nsjail / firejail) so that multiple independent isolation mechanisms compose as concepts/defense-in-depth.

Caveats

  • No QPS / throughput / fleet-size numbers for RenderServer in production — only qualitative "small fractions of a second" nsjail startup and "significantly faster" for the seccomp-only variant.
  • No disclosure of the exact seccomp allowlist (only the four named families: write-to-open-fd, exit, memory allocation, time).
  • No cost comparison (nsjail vs seccomp-only vs VM) in $/workload or CPU-seconds.
  • No incident retrospective — the post frames rollout as smooth modulo the rlimit_fsize foot-gun and the expected seccomp-allowlist iteration.
  • No multi-region / multi-AZ deployment detail — RenderServer's actual scheduling + failure recovery architecture is not disclosed (nsjail is a per-request process; the pool of workers invoking it is not.)
  • Compared-to-VMs section is qualitative ("the attack surface of a hypervisor is usually smaller than for an OS kernel") with CVE references but no quantitative surface measurement.

Source

Last updated · 178 distilled / 1,178 read