PATTERN Cited by 2 sources
Language rewrite for concurrency¶
Definition¶
Language rewrite for concurrency is the architectural pattern of rewriting a performance-critical layer in a language whose concurrency model matches the workload, once the current language's concurrency primitives have become the bottleneck — while keeping orchestration, definitions, or control logic in the original language.
Not a whole-system rewrite. Layer-targeted. The cost is scoped to the slice that's actually capped, while the broader codebase retains the ergonomic / ecosystem advantages of the starting language.
When the pattern fires¶
- Python → Go for mixed CPU + I/O request-serving. Trigger: GIL contention on CPU-bound work under high concurrency, with multi-processing adding coordination overhead rather than fixing it.
- JVM → Rust for latency-sensitive data-plane work. Trigger: JVM GC pauses tail into the seconds under fanout, and the tail- at-scale math (P(some host mid-pause | fanout=N) → 1) makes percentile SLOs unachievable.
- Node.js → Go / Rust for CPU-bound request handling where the event loop becomes the bottleneck (not directly in this wiki but the shape is identical).
What the rewrite delivers is not raw speed — it's a concurrency model (goroutines, async/await + zero-cost abstractions, actors) that fits the workload without paying the original language's concurrency-primitive cost.
Instance 1: Feast-Python → Go at Dropbox Dash (2025)¶
Feast's Python SDK was the initial feature-serving layer for Dash's feature store. Profiling under high concurrency showed CPU-bound JSON parsing + the GIL as the bottleneck. Multi-process workers improved latency "temporarily" but added coordination overhead.
Go rewrite delivers:
- Thousands of req/s per service.
- ~5–10ms overhead on top of the ~20ms Dynovault client latency.
- p95 ~25–35ms end-to-end.
Layer scope: just the serving HTTP/RPC endpoint. Feast (Python) still owns definitions + orchestration; Spark (PySpark) still owns feature computation.
(Source: sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash)
Instance 2: JVM → Rust at Aurora DSQL (2025)¶
Aurora DSQL started with a Kotlin (JVM) codebase and rewrote the data plane (and eventually the control plane too) in Rust. The forcing function was tail-at-scale: a 40-host simulation with 1s GC stalls delivered ~6K TPS and 10s tail against targets of ~1M TPS and 1s tail.
Pilot-component migration: start with the Adjudicator (simplest data-plane component + already had Rust client libraries); JVM devs rewrote it and got 30K TPS from first cut vs 2–3K TPS years-tuned Kotlin (~10×). Then the rest of the data plane. Then, after initial "Kotlin control plane / Rust data plane" split drift, also the control plane.
(Source: sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey)
Common sub-pattern: profile → language → pilot¶
Both instances share a three-step shape:
- Profile to localise the cap. Don't reach for a rewrite before the bottleneck is pinpointed. GIL contention on JSON parsing (Dropbox); GC-pause fanout amplification (DSQL).
- Choose the replacement language by concurrency model fit. Not by raw speed. Go's goroutines for mixed CPU + I/O; Rust's ownership + no-GC for latency-predictable data-plane work.
- Pilot on a component that has ecosystem headwind (existing libraries, easy contracts). Dropbox picks the serving layer because Feast's Python/Go boundary is an HTTP API. DSQL picks the Adjudicator because Rust journal-client libraries already existed.
What doesn't survive the rewrite¶
Both posts are explicit that the rewrite isn't a universal improvement:
- You lose the original language's ecosystem / developer ergonomics / hiring funnel in the rewritten layer.
- You introduce a cross-language boundary that has to be maintained (serialisation format, deployment topology).
- If the rewrite is partial, you now have two codebases with shared domain knowledge that can drift (DSQL explicitly retracted the Kotlin-control-plane / Rust-data-plane split to full-system Rust for this reason).
When NOT to apply¶
- The current bottleneck isn't actually the language's concurrency model. (If CPU is idle and latency is high, it's an I/O or downstream-dependency problem, not a GIL / GC problem.)
- The traffic shape won't grow into the current cap. (If you're at 1/10 of the cap and demand is flat, rewrite cost dominates.)
- The team can't carry two languages operationally (on-call coverage, tooling, build systems).
Related¶
- concepts/gil-contention — the Python-specific forcing function.
- concepts/memory-safety — adjacent language-choice axis (why Rust specifically, when the alternative is also memory-safe via GC).
- patterns/pilot-component-language-migration — how to stage such a rewrite.
- systems/dash-feature-store, systems/aurora-dsql — the two canonical instances.
Seen in¶
- sources/2025-12-18-dropbox-feature-store-powering-real-time-ai-dash — Python → Go serving rewrite at systems/dash-feature-store.
- sources/2025-05-27-allthingsdistributed-aurora-dsql-rust-journey — JVM → Rust rewrite at systems/aurora-dsql.