CONCEPT Cited by 1 source
Cross-cluster networking¶
Cross-cluster networking is the set of problems that arise when a workload's components end up running in different clusters — each with its own network namespace, security-group boundary, and ingress rules — and the workload's communication pattern assumes they are co-located. The hard cases are bidirectional patterns where both sides need to initiate inbound connections to the other.
The Lyft / LyftLearn 2.0 instance¶
Lyft's Spark interactive workflow ran driver and executors in the same Kubernetes cluster. When LyftLearn 2.0 moved notebooks to SageMaker Studio while keeping executors on EKS, Spark's client-mode bidirectional communication broke:
- Driver (in SageMaker Studio) → EKS API Server → request executor pods. Outbound from the driver side.
- Executor pods (on EKS) → driver's SageMaker Studio Elastic Network Interface. Inbound to the Studio side.
SageMaker Studio's default networking blocked the inbound leg. This was "a fundamental blocker that could jeopardize the entire migration."
Resolution: AWS partnered with Lyft to introduce networking changes to the Studio Domains in Lyft's account that permitted the required inbound EKS→Studio traffic. After the change, Spark performance and interactive UX were unchanged from the single-cluster setup (Source: sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture).
Why managed services make this harder¶
Self-hosted clusters expose low-level network primitives (VPC routes, security groups, CNI plugins) that operators can configure to admit specific cross-cluster flows. Managed services — SageMaker Studio, App Runner, Dataflow, managed notebooks — hide these primitives behind a narrower configuration surface. If the managed service's defaults don't admit the needed flow, the customer can hit a wall that requires provider-side intervention rather than customer-configurable fixes. Lyft's outcome is a concrete instance: AWS had to introduce networking changes Lyft couldn't configure themselves.
Generalisable shape¶
Cross-cluster networking issues tend to share:
- A bidirectional communication pattern whose client-mode design implicitly assumed co-location.
- A managed service on one side whose network defaults restrict inbound connections.
- A migration or architecture change that separates the components into different network trust boundaries.
Addressing them requires either (a) making the protocol one-sided (server-mode Spark; proxy in the middle), or (b) explicit network-path configuration (often a partnership with the managed service's provider).
Seen in¶
- sources/2025-11-18-lyft-lyftlearn-evolution-rethinking-ml-platform-architecture — canonical wiki instance: interactive Spark in SageMaker Studio notebook ↔ executors on EKS.