PATTERN Cited by 1 source

Seccomp-bpf container composition¶

Shape¶

Compose the independent Linux isolation primitives — namespaces + cgroups + capability dropping + seccomp-bpf — into a single sandbox. No one primitive is sufficient; each defends a different axis, and a compromise of one is not automatically a compromise of the others (defence-in-depth at the kernel-primitive layer).

┌──────────────────────────────────────────────────┐
│ nsjail / firejail / Docker+seccomp / custom      │
│                                                  │
│  ┌───────────────────────────────────────────┐   │
│  │ namespaces: user + pid + mount + net       │  │   What can the process SEE?
│  ├───────────────────────────────────────────┤   │
│  │ cgroups: CPU / mem / IO / PIDs limit      │   │   How much can it USE?
│  ├───────────────────────────────────────────┤   │
│  │ capability set: drop CAP_SYS_ADMIN, ...   │   │   What POWER does it have?
│  ├───────────────────────────────────────────┤   │
│  │ seccomp-bpf syscall allowlist             │   │   Which KERNEL CODE can it reach?
│  ├───────────────────────────────────────────┤   │
│  │ (optional) LSM: SELinux / AppArmor        │   │   What FILE LABELS?
│  └───────────────────────────────────────────┘   │
└──────────────────────────────────────────────────┘

(Source: sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp)

Why compose¶

Each primitive has a specific limit that the others cover:

Seccomp alone cannot filter openat by path (pointer dereference) — so add a mount namespace that only makes safe files visible.
Mount namespace alone cannot prevent a process from calling socket() and reaching network services — so add a network namespace with no interfaces.
Network + mount namespaces alone cannot stop a CPU- bomb or memory-hog — so add cgroups with resource limits.
Cgroups alone cannot prevent a compromised process from reaching kernel bugs through obscure syscalls — so add a seccomp allowlist excluding the obscure ones.

Figma's framing:

"Seccomp can be combined with containerization to provide robust, multilayered sandbox-focused systems, such as nsjail and firejail."

No single layer is assumed perfect; compromise requires defeating multiple independent kernel mechanisms.

Canonical implementations¶

nsjail ¶

Google's command-line tool stacks: namespaces (user + pid + mount + net) + filesystem restrictions + cgroups + resource limits + seccomp. Per-invocation. No image, no daemon. Figma's production sandbox for RenderServer's full-featured GPU path.

firejail ¶

SUID-based, profile-library-driven sibling of nsjail. Same composition idea; oriented toward desktop-app sandboxing.

Docker / runC with seccomp profile¶

Container runtimes ship a default seccomp profile that filters ~50 rarely-used syscalls (keyctl, add_key, mount, etc.). Composed with the container runtime's default namespace + cgroup setup. Operator can extend.

Configuration discipline¶

Composition is only as strong as the weakest layer set correctly. Figma's framing:

"Unlike commodity VM solutions, containers place a much greater responsibility on the user to correctly configure the desired level of isolation. More control over security configuration also means more room to make mistakes."

Operator foot-guns that break composition:

--privileged removes most isolation.
--cap-add SYS_ADMIN restores a wide capability.
--seccomp=unconfined disables seccomp entirely.
Mounting the host Docker socket gives container- escalation capability to any compromised process.
Host network (--net=host) collapses the network namespace.

When to use this pattern over seccomp-only¶

Workload is not source-modifiable (can't apply the refactor pattern).
Workload needs dynamic file / network access during the untrusted-input phase — seccomp-only can't filter those, but a mount/network namespace can mask what's visible.
Workload is commodity (third-party binaries, scripts from untrusted source) — you need coarse containment, not fine-grained syscall curation.
Multiple independent workloads on the same host need pairwise isolation — namespaces + cgroups are the primitives that give that.

When refactor-for-seccomp is preferable¶

You own the source.
Performance matters — the composition-layer sandbox (nsjail etc.) has measurable per-invocation overhead (tens to low hundreds of milliseconds startup); a process-lifetime-long seccomp filter has none.
Operational simplicity matters — one filter, one program, one binary.

Figma uses both patterns in RenderServer: nsjail for the GPU path where the program couldn't be refactored tightly enough (dynamic font / image loading needed during render); seccomp-only refactor for the non-GPU path where they could.

patterns/refactor-for-seccomp-filter — the narrow- but-sharp alternative when you own the code.

Seen in¶

sources/2026-04-21-figma-server-side-sandboxing-containers-and-seccomp — canonical framing; Figma explicitly names nsjail + firejail as composition examples and contrasts them with the refactor-then-seccomp-only approach.