Skip to content

PATTERN Cited by 1 source

Seccomp-bpf container composition

Shape

Compose the independent Linux isolation primitives — namespaces + cgroups + capability dropping + seccomp-bpf — into a single sandbox. No one primitive is sufficient; each defends a different axis, and a compromise of one is not automatically a compromise of the others (defence-in-depth at the kernel-primitive layer).

┌──────────────────────────────────────────────────┐
│ nsjail / firejail / Docker+seccomp / custom      │
│                                                  │
│  ┌───────────────────────────────────────────┐   │
│  │ namespaces: user + pid + mount + net       │  │   What can the process SEE?
│  ├───────────────────────────────────────────┤   │
│  │ cgroups: CPU / mem / IO / PIDs limit      │   │   How much can it USE?
│  ├───────────────────────────────────────────┤   │
│  │ capability set: drop CAP_SYS_ADMIN, ...   │   │   What POWER does it have?
│  ├───────────────────────────────────────────┤   │
│  │ seccomp-bpf syscall allowlist             │   │   Which KERNEL CODE can it reach?
│  ├───────────────────────────────────────────┤   │
│  │ (optional) LSM: SELinux / AppArmor        │   │   What FILE LABELS?
│  └───────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

(Source: sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp)

Why compose

Each primitive has a specific limit that the others cover:

  • Seccomp alone cannot filter openat by path (pointer dereference) — so add a mount namespace that only makes safe files visible.
  • Mount namespace alone cannot prevent a process from calling socket() and reaching network services — so add a network namespace with no interfaces.
  • Network + mount namespaces alone cannot stop a CPU- bomb or memory-hog — so add cgroups with resource limits.
  • Cgroups alone cannot prevent a compromised process from reaching kernel bugs through obscure syscalls — so add a seccomp allowlist excluding the obscure ones.

Figma's framing:

"Seccomp can be combined with containerization to provide robust, multilayered sandbox-focused systems, such as nsjail and firejail."

No single layer is assumed perfect; compromise requires defeating multiple independent kernel mechanisms.

Canonical implementations

nsjail

Google's command-line tool stacks: namespaces (user + pid + mount + net) + filesystem restrictions + cgroups + resource limits + seccomp. Per-invocation. No image, no daemon. Figma's production sandbox for RenderServer's full-featured GPU path.

firejail

SUID-based, profile-library-driven sibling of nsjail. Same composition idea; oriented toward desktop-app sandboxing.

Docker / runC with seccomp profile

Container runtimes ship a default seccomp profile that filters ~50 rarely-used syscalls (keyctl, add_key, mount, etc.). Composed with the container runtime's default namespace + cgroup setup. Operator can extend.

Configuration discipline

Composition is only as strong as the weakest layer set correctly. Figma's framing:

"Unlike commodity VM solutions, containers place a much greater responsibility on the user to correctly configure the desired level of isolation. More control over security configuration also means more room to make mistakes."

Operator foot-guns that break composition:

  • --privileged removes most isolation.
  • --cap-add SYS_ADMIN restores a wide capability.
  • --seccomp=unconfined disables seccomp entirely.
  • Mounting the host Docker socket gives container- escalation capability to any compromised process.
  • Host network (--net=host) collapses the network namespace.

When to use this pattern over seccomp-only

  • Workload is not source-modifiable (can't apply the refactor pattern).
  • Workload needs dynamic file / network access during the untrusted-input phase — seccomp-only can't filter those, but a mount/network namespace can mask what's visible.
  • Workload is commodity (third-party binaries, scripts from untrusted source) — you need coarse containment, not fine-grained syscall curation.
  • Multiple independent workloads on the same host need pairwise isolation — namespaces + cgroups are the primitives that give that.

When refactor-for-seccomp is preferable

  • You own the source.
  • Performance matters — the composition-layer sandbox (nsjail etc.) has measurable per-invocation overhead (tens to low hundreds of milliseconds startup); a process-lifetime-long seccomp filter has none.
  • Operational simplicity matters — one filter, one program, one binary.

Figma uses both patterns in RenderServer: nsjail for the GPU path where the program couldn't be refactored tightly enough (dynamic font / image loading needed during render); seccomp-only refactor for the non-GPU path where they could.

Seen in

Last updated · 200 distilled / 1,178 read