CONCEPT

Deep Reinforcement Learning congestion control¶

Definition¶

Deep Reinforcement Learning (DRL) congestion control models the sender's pacing decision as a policy learned from experience, rather than hand-tuning heuristics like CUBIC or BBR. A reinforcement- learning agent observes network state (throughput, delay, loss) and adjusts the congestion window or sending rate to maximise a reward signal — typically throughput minus penalty for delay and packet loss.

DRL-for-CC is an active research axis on top of user-space QUIC. Zalando's 2024-06 post flags it as "the main concept exploited in the research dedicated for protocol improvements in 5G networks" (Source: ).

Canonical framing on the wiki¶

Zalando names four DRL CC algorithms from the research literature:

Aurora — one of the earliest DRL-for-CC systems.
Eagle
Orca
PQB

The wiki does not model each algorithm individually — the canonical framing is that DRL-for-CC is a research axis, not a specific production-ready algorithm. Zalando's post cites lab results showing "higher throughput and round-trip performance under various network settings to compare with competing solutions (e.g. BRR or Remy)." The implementation- to-production transit is still open at the time of the 2024-06 writeup.

Why it's economic now¶

DRL CC requires the flexibility to evaluate new algorithms in production, which kernel-TCP-era CC did not permit. QUIC's user-space CC mutability is the precondition. An A/B-test-per-algorithm deployment pattern works in user space, not in kernel.

The 5G driving case¶

Zalando frames DRL CC as particularly promising for the RAN-bottlenecked 5G environment:

Heuristic CCs (NewReno, CUBIC) misinterpret RF loss as congestion.
BBR is the best heuristic option for 5G but still has limits.
DRL CC can learn the 5G-specific loss / delay distribution directly from experience.

The research claim: a DRL CC trained on 5G-RAN-shaped environments outperforms BBR on that workload because the learned policy is tuned to the specific noise / blockage / handover profile.

Open problems¶

Generalisation across networks. A DRL policy trained on one network type (e.g. 5G mid-band) may underperform on another (e.g. wired datacentre). Multi-task / online- learning approaches are active research.
Safety / fairness with existing CC. A DRL CC must coexist with TCP-CUBIC flows in the internet today. Unfairness (DRL-agent hogging bandwidth) is an adoption blocker — similar to BBR v1's known CUBIC-coexistence issues.
Explainability / debuggability. A heuristic CC's behaviour is auditable by reading the algorithm; a DRL CC's is not. For a CDN running on thousands of POPs, incident-triage capability is load-bearing.
Reward shaping. The reward function must balance throughput, delay, and loss — different services want different trade-offs (e.g. video prefers low delay; bulk transfer prefers high throughput).

Wiki framing¶

DRL-for-CC is the research axis that QUIC's user-space architecture makes economic, the 5G-RAN bottleneck makes valuable, and heuristic CC's limits make necessary. Zalando presents it as a forward-looking research direction rather than a production system; this wiki entry follows that framing.

Seen in¶

— canonical wiki instance. Zalando positions DRL CC as the research direction expected to drive the next wave of HTTP/3 protocol improvements for 5G.

concepts/user-space-congestion-control — the precondition.
concepts/bbr-congestion-control — the current state-of- the-art heuristic that DRL CC aims to improve on.
concepts/cubic-congestion-control — the dominant heuristic in production today.
concepts/radio-access-network-bottleneck — the driving workload.
concepts/quic-transport — the protocol substrate.