CONCEPT Cited by 10 sources
Control plane / data plane separation¶
Architectural split between the "decide" path (control plane: validation, authorization, policy, rollout decisions, scheduling) and the "deliver" path (data plane: storage, distribution, serving the actual bytes at scale). Established pattern in networking (SDN, Envoy xDS), Kubernetes (API server vs kubelets), and service mesh, now common in config platforms.
Why separate them¶
- Independent evolution. Rollout strategy changes (e.g., adding a new canary policy) don't touch the storage/distribution subsystem; storage changes don't force control-plane changes.
- Different scaling profiles. Data-plane traffic is usually orders of magnitude higher than control-plane traffic — bundling them forces the control plane to scale with request load it doesn't actually need.
- Different failure semantics. Control-plane outages should leave the data plane still serving last-known-good state. Config/feature-flag platforms explicitly rely on this: sidecars and local caches keep services running when the control plane is unavailable.
- Different blast radius. A bad control-plane decision affects new rollouts; a bad data-plane change affects every running client.
Seen in¶
- sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar explicitly frames its architecture as a control-plane ("decide") vs data-plane ("deliver") split so rollout strategies and storage/delivery can evolve independently.
- sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey — systems/aurora-dsql shipped an initial split (Kotlin control plane, Rust data plane) and later retracted it in favor of unified Rust. See contradiction section below.
- sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing — Databricks' proxyless service-mesh architecture is an explicit control/data-plane split: a custom xDS Endpoint Discovery Service (systems/databricks-endpoint-discovery-service) = control plane (watches Kubernetes, streams topology); Armeria-embedded client libraries + Envoy ingress = data planes. Same control-plane feeds two distinct data-plane consumers (internal RPC + edge proxy) via the same concepts/xds-protocol.
- sources/2026-01-13-databricks-open-sourcing-dicer-auto-sharder — systems/dicer is an explicit control-plane / data-plane split at the sharding tier: Assigner = control plane (consumes health + load signals, publishes Assignments); Slicelet (server-side library)
- Clerk (client-side library) = data planes. Like Sitar and EDS, the data-plane caches last-known assignment locally so it keeps serving during control-plane hiccups. The Assigner's multi-tenancy (one Assigner per region serves many Targets) is a direct control-plane scaling lever.
- sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development —
remote-access variant: AWS Systems Manager's SSM Session Manager is
an IAM-authenticated control plane that orchestrates SSH-over-SSM
tunnels; actual session bytes flow over a separate data plane
(relay → agent-on-target). SageMaker AI's
StartSessionAPI is layered on this separation. See systems/aws-systems-manager + patterns/secure-tunnel-to-managed-compute. - sources/2024-10-28-dropbox-robinhood-in-house-load-balancing —
Robinhood is a load-balancing-specific
control/data split: LBS = control plane (per-node PID controllers
compute endpoint weights); ZK/etcd routing DB = eventually-
consistent data-plane handoff; Envoy / gRPC clients = data planes
doing weighted-RR per request. Adds two design notes the other
examples don't emphasize: (a) a fanout-reducing proxy tier sits
between the data plane and the control plane so the LBS doesn't hold
O(nodes × services)TLS connections, and (b) the control plane is sharded by service via an in-house shard manager — each service has exactly one primary LBS worker to avoid concurrent-write contention. This is stronger than the "one control plane feeds many data planes" framing of Databricks EDS — it's "one control plane per shard of services, fed by a proxy tier, feeds heterogeneous data planes (Envoy, gRPC, edge)." - sources/2026-04-21-figma-figcache-next-generation-data-caching-platform — systems/figcache applies the split at a caching proxy tier: the Starlark-authored engine-tree configuration (rendered to typed Protobuf at server init) is the control-plane artifact (patterns/starlark-configuration-dsl), and the stateless FigCache fleet executing that tree against upstream Redis clusters is the data plane. Operators express complex runtime behaviors (command-type splitting, key-prefix routing, multi-cluster dispatch, QoS backpressure, inline transformations) exclusively in configuration, without binary redeploys — the control-plane change is a Starlark program rev, not a server release. Companion to Airbnb Sitar as a "config-platform"-shape application of the split; distinguished by Starlark as the authoring surface (vs YAML + dynamic config fetches) and by the engine-tree / Protobuf-rendered- config shape (vs flat K/V config).
- sources/2026-02-26-aws-santander-catalyst-platform-engineering — Santander Catalyst applies the split at infrastructure-provisioning tier — the first wiki instance of the pattern at this layer. A single EKS cluster is the explicit control plane ("the brain of the operation, orchestrating all components and workflows") hosting three sub-components: data-plane claims (ArgoCD / concepts/gitops), policies catalog ( OPA Gatekeeper), and stacks catalog ( Crossplane XRDs + Compositions via patterns/crossplane-composition). The data plane is the actual provisioned AWS (and, by Crossplane's design, multi-cloud) resources running tenant workloads — RDS instances, Lambda functions, Step Functions workflows, Databricks integrations. Canonical distinguishing property vs all prior wiki instances: both the control plane and the data plane here are infrastructure, not request-handling traffic — Catalyst's control plane decides which resources to create, its data plane is the running resources. Sibling shape to Kubernetes' own control-plane-vs-kubelet split, now recursively applied on top of EKS to manage cloud resources beyond K8s itself.
- sources/2026-02-24-pinterest-piqama-pinterest-quota-management-ecosystem — Piqama applies the split to quota management: the platform itself is the control plane (REST + Thrift portal + CRUD + authorization + validation + dispatch + auto-rightsizing + chargeback integration); the data plane is application-specific via pluggable enforcement hooks — Yunikorn queue configs for Moka capacity quotas, or the Service-Protection Framework (SPF) in-process library driving local rate-limit decisions for TiDB + KV Stores. Canonical distinguishing property: data-plane enforcement mechanism is itself pluggable per integration — one control plane serves two structurally-different quota kinds (capacity vs rate-limit) with different schedulers / libraries. The control plane also writes back to itself via a telemetry loop (Iceberg on S3 → separate rightsizing service → Piqama API), completing a closed-loop control system. Sibling of Sitar / Figcache / EDS in the "config-platform" shape but generalised across enforcement mechanisms rather than fixed to one. Named instance of generic quota management platform + async-centralized quota with local enforcement.
- — PlanetScale canonicalises the split at database-as-a-service tier with an emphasis on the asymmetry of dependency: Max Englander frames the data plane as "The most critical plane, with fewer dependencies than the control plane. Does not depend on the control plane" while the control plane "is less critical than the data plane, and so has more dependencies" — including a PlanetScale database as its own metadata store (a deliberate circular dependency, safe only because the data plane survives control-plane failure). Canonical distinguishing property: more-critical = fewer-dependencies, inverting the naive "more-critical = more-redundant" heuristic. Failure-mode taxonomy: "a hypothetical failure in one of our cloud providers' Docker registry services might impact our ability to create new database instances, but will not impact existing instances' ability to serve queries or store data". Both planes are multi-AZ redundant; the split composes with isolation + static stability
- weekly failover drill as the complete reliability framework.
-
— Canonical production-test of PlanetScale's control-plane / data-plane separation in a real upstream outage. Phase 1 of the 2025-10-20 AWS us-east-1 incident took PlanetScale's control plane offline for ~2 h 17 min via a four-hop transitive dependency chain (new-branch/resize service → internal secret-distribution → S3 → STS → DynamoDB, DNS misconfiguration upstream). Verbatim outcome: "Throughout this period, no database branches lost capacity or connectivity." The design claim was validated because (a) the data plane was not on the request-path back to the control plane, (b) credentials and routing state were already cached in running data-plane processes, (c) the unaffected operations were all "create / change" verbs (new branch / resize / config) which could queue until the control plane recovered. First wiki-canonical production-survival proof of the principle — canonicalised as the outcome-shape concepts/control-plane-impact-without-data-plane-impact. Phase 2 shows the opposite shape: an EC2 launch failure plus partial network partitions did partially affect the data plane, because some data-plane workflows (backup, branch creation, resize) implicitly launch EC2.
-
sources/2026-05-27-databricks-how-the-lakebase-architecture-stays-resilient-to-cloud-failures — Inversion-corner-case canonicalisation: under agentic / on-demand workloads the start verb on the control plane has data-plane availability requirements. First wiki disclosure of the workload-driven reframe — see concepts/control-plane-as-the-new-data-plane for the dedicated concept and patterns/separate-data-plane-controller-for-hot-path for the architectural pattern. Verbatim: "In monolithic cloud database service architecture, the data plane is the critical part of the service. It's designed for 99.99+% availability and static stability. The control plane matters 'only' for management operations. With agentic and on-demand workloads, the part of the control plane that starts databases is effectively the data plane." The Lakebase architectural reply preserves the structural separation (hot-path subset gets its own service with strict dependencies + DiD + GD) rather than collapsing the split. The inversion-corner-case is distinct from PlanetScale's "data plane = more critical = fewer dependencies" framing — under agentic workloads, the control plane's hot-path subset has the same critical-path-on-customer-request property as the traditional data plane, even though the traditional data plane (page reads / writes from Postgres to storage) remains a separate critical surface. The Lakebase reply is two distinct critical-path surfaces rather than one. Empirical signal: 90% of compute sessions for auto-suspending databases in Neon are <10 minutes; every auto-resume hits the start verb. Validates Englander's "more-critical = fewer-dependencies" heuristic with a new wrinkle: which verb counts as "more-critical" is workload-shape-dependent, not architecturally-fixed.
Contradiction: "different languages OK for each plane"¶
A common corollary of this split is language freedom per plane: use a productive managed language (e.g. Java/Kotlin) for the control plane, and a systems language (e.g. Rust) for the latency-sensitive data plane. Airbnb Sitar-style platforms do this easily.
Aurora DSQL shipped this pattern and then reversed it (2024). Reasons, from (sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey):
- DSQL's control plane does more than CRUD — it drives hot-cluster detection, topology changes, scaling decisions — which means it shares non-trivial logic with the data plane.
- Two languages → no shared library for that logic → Kotlin and Rust versions drift over time, each drift triggers a debug-fix-deploy loop.
- Two languages → no shared simulation tooling → the team can't co-test control + data plane behavior.
- Resolution: rewrite the Kotlin control plane in Rust. End-state p99 tracks p50 closely across the unified system.
Takeaway: control/data split is still correct as an architectural separation, but its language-choice corollary is contingent on how much logic actually needs to be shared. When the control plane is thin and pure-orchestration (Sitar), polyglot works. When the control plane carries domain logic that the data plane also uses (DSQL), language unification beats the productivity win of polyglot.
Related¶
- systems/sitar — canonical dynamic-config example
- systems/aurora-dsql — case where the language-per-plane corollary was retracted
- patterns/staged-rollout — lives in the control plane
- patterns/sidecar-agent — typical data-plane edge