PATTERN Cited by 1 source
Seccomp-bpf container composition¶
Shape¶
Compose the independent Linux isolation primitives — namespaces + cgroups + capability dropping + seccomp-bpf — into a single sandbox. No one primitive is sufficient; each defends a different axis, and a compromise of one is not automatically a compromise of the others (defence-in-depth at the kernel-primitive layer).
┌──────────────────────────────────────────────────┐
│ nsjail / firejail / Docker+seccomp / custom │
│ │
│ ┌───────────────────────────────────────────┐ │
│ │ namespaces: user + pid + mount + net │ │ What can the process SEE?
│ ├───────────────────────────────────────────┤ │
│ │ cgroups: CPU / mem / IO / PIDs limit │ │ How much can it USE?
│ ├───────────────────────────────────────────┤ │
│ │ capability set: drop CAP_SYS_ADMIN, ... │ │ What POWER does it have?
│ ├───────────────────────────────────────────┤ │
│ │ seccomp-bpf syscall allowlist │ │ Which KERNEL CODE can it reach?
│ ├───────────────────────────────────────────┤ │
│ │ (optional) LSM: SELinux / AppArmor │ │ What FILE LABELS?
│ └───────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
(Source: sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp)
Why compose¶
Each primitive has a specific limit that the others cover:
- Seccomp alone cannot filter
openatby path (pointer dereference) — so add a mount namespace that only makes safe files visible. - Mount namespace alone cannot prevent a process from
calling
socket()and reaching network services — so add a network namespace with no interfaces. - Network + mount namespaces alone cannot stop a CPU- bomb or memory-hog — so add cgroups with resource limits.
- Cgroups alone cannot prevent a compromised process from reaching kernel bugs through obscure syscalls — so add a seccomp allowlist excluding the obscure ones.
Figma's framing:
"Seccomp can be combined with containerization to provide robust, multilayered sandbox-focused systems, such as nsjail and firejail."
No single layer is assumed perfect; compromise requires defeating multiple independent kernel mechanisms.
Canonical implementations¶
nsjail¶
Google's command-line tool stacks: namespaces (user + pid + mount + net) + filesystem restrictions + cgroups + resource limits + seccomp. Per-invocation. No image, no daemon. Figma's production sandbox for RenderServer's full-featured GPU path.
firejail¶
SUID-based, profile-library-driven sibling of nsjail. Same composition idea; oriented toward desktop-app sandboxing.
Docker / runC with seccomp profile¶
Container runtimes ship a default seccomp profile that filters ~50 rarely-used syscalls (keyctl, add_key, mount, etc.). Composed with the container runtime's default namespace + cgroup setup. Operator can extend.
Configuration discipline¶
Composition is only as strong as the weakest layer set correctly. Figma's framing:
"Unlike commodity VM solutions, containers place a much greater responsibility on the user to correctly configure the desired level of isolation. More control over security configuration also means more room to make mistakes."
Operator foot-guns that break composition:
--privilegedremoves most isolation.--cap-add SYS_ADMINrestores a wide capability.--seccomp=unconfineddisables seccomp entirely.- Mounting the host Docker socket gives container- escalation capability to any compromised process.
- Host network (
--net=host) collapses the network namespace.
When to use this pattern over seccomp-only¶
- Workload is not source-modifiable (can't apply the refactor pattern).
- Workload needs dynamic file / network access during the untrusted-input phase — seccomp-only can't filter those, but a mount/network namespace can mask what's visible.
- Workload is commodity (third-party binaries, scripts from untrusted source) — you need coarse containment, not fine-grained syscall curation.
- Multiple independent workloads on the same host need pairwise isolation — namespaces + cgroups are the primitives that give that.
When refactor-for-seccomp is preferable¶
- You own the source.
- Performance matters — the composition-layer sandbox (nsjail etc.) has measurable per-invocation overhead (tens to low hundreds of milliseconds startup); a process-lifetime-long seccomp filter has none.
- Operational simplicity matters — one filter, one program, one binary.
Figma uses both patterns in RenderServer: nsjail for the GPU path where the program couldn't be refactored tightly enough (dynamic font / image loading needed during render); seccomp-only refactor for the non-GPU path where they could.
Related patterns¶
- patterns/refactor-for-seccomp-filter — the narrow- but-sharp alternative when you own the code.
Seen in¶
- sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp — canonical framing; Figma explicitly names nsjail + firejail as composition examples and contrasts them with the refactor-then-seccomp-only approach.