Skip to content

CONCEPT Cited by 8 sources

Control plane / data plane separation

Architectural split between the "decide" path (control plane: validation, authorization, policy, rollout decisions, scheduling) and the "deliver" path (data plane: storage, distribution, serving the actual bytes at scale). Established pattern in networking (SDN, Envoy xDS), Kubernetes (API server vs kubelets), and service mesh, now common in config platforms.

Why separate them

  • Independent evolution. Rollout strategy changes (e.g., adding a new canary policy) don't touch the storage/distribution subsystem; storage changes don't force control-plane changes.
  • Different scaling profiles. Data-plane traffic is usually orders of magnitude higher than control-plane traffic — bundling them forces the control plane to scale with request load it doesn't actually need.
  • Different failure semantics. Control-plane outages should leave the data plane still serving last-known-good state. Config/feature-flag platforms explicitly rely on this: sidecars and local caches keep services running when the control plane is unavailable.
  • Different blast radius. A bad control-plane decision affects new rollouts; a bad data-plane change affects every running client.

Seen in

  • sources/2026-02-18-airbnb-sitar-dynamic-configuration — Airbnb Sitar explicitly frames its architecture as a control-plane ("decide") vs data-plane ("deliver") split so rollout strategies and storage/delivery can evolve independently.
  • sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journeysystems/aurora-dsql shipped an initial split (Kotlin control plane, Rust data plane) and later retracted it in favor of unified Rust. See contradiction section below.
  • sources/2025-10-01-databricks-intelligent-kubernetes-load-balancing — Databricks' proxyless service-mesh architecture is an explicit control/data-plane split: a custom xDS Endpoint Discovery Service (systems/databricks-endpoint-discovery-service) = control plane (watches Kubernetes, streams topology); Armeria-embedded client libraries + Envoy ingress = data planes. Same control-plane feeds two distinct data-plane consumers (internal RPC + edge proxy) via the same concepts/xds-protocol.
  • sources/2026-01-13-databricks-open-sourcing-dicer-auto-shardersystems/dicer is an explicit control-plane / data-plane split at the sharding tier: Assigner = control plane (consumes health + load signals, publishes Assignments); Slicelet (server-side library)
  • Clerk (client-side library) = data planes. Like Sitar and EDS, the data-plane caches last-known assignment locally so it keeps serving during control-plane hiccups. The Assigner's multi-tenancy (one Assigner per region serves many Targets) is a direct control-plane scaling lever.
  • sources/2025-08-06-allthingsdistributed-removing-friction-sagemaker-ai-development — remote-access variant: AWS Systems Manager's SSM Session Manager is an IAM-authenticated control plane that orchestrates SSH-over-SSM tunnels; actual session bytes flow over a separate data plane (relay → agent-on-target). SageMaker AI's StartSession API is layered on this separation. See systems/aws-systems-manager + patterns/secure-tunnel-to-managed-compute.
  • sources/2024-10-28-dropbox-robinhood-in-house-load-balancingRobinhood is a load-balancing-specific control/data split: LBS = control plane (per-node PID controllers compute endpoint weights); ZK/etcd routing DB = eventually- consistent data-plane handoff; Envoy / gRPC clients = data planes doing weighted-RR per request. Adds two design notes the other examples don't emphasize: (a) a fanout-reducing proxy tier sits between the data plane and the control plane so the LBS doesn't hold O(nodes × services) TLS connections, and (b) the control plane is sharded by service via an in-house shard manager — each service has exactly one primary LBS worker to avoid concurrent-write contention. This is stronger than the "one control plane feeds many data planes" framing of Databricks EDS — it's "one control plane per shard of services, fed by a proxy tier, feeds heterogeneous data planes (Envoy, gRPC, edge)."
  • sources/2026-04-21-figma-figcache-next-generation-data-caching-platformsystems/figcache applies the split at a caching proxy tier: the Starlark-authored engine-tree configuration (rendered to typed Protobuf at server init) is the control-plane artifact (patterns/starlark-configuration-dsl), and the stateless FigCache fleet executing that tree against upstream Redis clusters is the data plane. Operators express complex runtime behaviors (command-type splitting, key-prefix routing, multi-cluster dispatch, QoS backpressure, inline transformations) exclusively in configuration, without binary redeploys — the control-plane change is a Starlark program rev, not a server release. Companion to Airbnb Sitar as a "config-platform"-shape application of the split; distinguished by Starlark as the authoring surface (vs YAML + dynamic config fetches) and by the engine-tree / Protobuf-rendered- config shape (vs flat K/V config).
  • sources/2026-02-26-aws-santander-catalyst-platform-engineeringSantander Catalyst applies the split at infrastructure-provisioning tier — the first wiki instance of the pattern at this layer. A single EKS cluster is the explicit control plane ("the brain of the operation, orchestrating all components and workflows") hosting three sub-components: data-plane claims (ArgoCD / concepts/gitops), policies catalog ( OPA Gatekeeper), and stacks catalog ( Crossplane XRDs + Compositions via patterns/crossplane-composition). The data plane is the actual provisioned AWS (and, by Crossplane's design, multi-cloud) resources running tenant workloads — RDS instances, Lambda functions, Step Functions workflows, Databricks integrations. Canonical distinguishing property vs all prior wiki instances: both the control plane and the data plane here are infrastructure, not request-handling traffic — Catalyst's control plane decides which resources to create, its data plane is the running resources. Sibling shape to Kubernetes' own control-plane-vs-kubelet split, now recursively applied on top of EKS to manage cloud resources beyond K8s itself.

⚠️ Contradiction: "different languages OK for each plane"

A common corollary of this split is language freedom per plane: use a productive managed language (e.g. Java/Kotlin) for the control plane, and a systems language (e.g. Rust) for the latency-sensitive data plane. Airbnb Sitar-style platforms do this easily.

Aurora DSQL shipped this pattern and then reversed it (2024). Reasons, from (sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey):

  • DSQL's control plane does more than CRUD — it drives hot-cluster detection, topology changes, scaling decisions — which means it shares non-trivial logic with the data plane.
  • Two languages → no shared library for that logic → Kotlin and Rust versions drift over time, each drift triggers a debug-fix-deploy loop.
  • Two languages → no shared simulation tooling → the team can't co-test control + data plane behavior.
  • Resolution: rewrite the Kotlin control plane in Rust. End-state p99 tracks p50 closely across the unified system.

Takeaway: control/data split is still correct as an architectural separation, but its language-choice corollary is contingent on how much logic actually needs to be shared. When the control plane is thin and pure-orchestration (Sitar), polyglot works. When the control plane carries domain logic that the data plane also uses (DSQL), language unification beats the productivity win of polyglot.

Last updated · 178 distilled / 1,178 read