Skip to content

CONCEPT Cited by 1 source

Deep Reinforcement Learning congestion control

Definition

Deep Reinforcement Learning (DRL) congestion control models the sender's pacing decision as a policy learned from experience, rather than hand-tuning heuristics like CUBIC or BBR. A reinforcement- learning agent observes network state (throughput, delay, loss) and adjusts the congestion window or sending rate to maximise a reward signal — typically throughput minus penalty for delay and packet loss.

DRL-for-CC is an active research axis on top of user-space QUIC. Zalando's 2024-06 post flags it as "the main concept exploited in the research dedicated for protocol improvements in 5G networks" (Source: sources/2024-06-17-zalando-next-level-customer-experience-with-http3-traffic-engineering).

Canonical framing on the wiki

Zalando names four DRL CC algorithms from the research literature:

  • Aurora — one of the earliest DRL-for-CC systems.
  • Eagle
  • Orca
  • PQB

The wiki does not model each algorithm individually — the canonical framing is that DRL-for-CC is a research axis, not a specific production-ready algorithm. Zalando's post cites lab results showing "higher throughput and round-trip performance under various network settings to compare with competing solutions (e.g. BRR or Remy)." The implementation- to-production transit is still open at the time of the 2024-06 writeup.

Why it's economic now

DRL CC requires the flexibility to evaluate new algorithms in production, which kernel-TCP-era CC did not permit. QUIC's user-space CC mutability is the precondition. An A/B-test-per-algorithm deployment pattern works in user space, not in kernel.

The 5G driving case

Zalando frames DRL CC as particularly promising for the RAN-bottlenecked 5G environment:

  • Heuristic CCs (NewReno, CUBIC) misinterpret RF loss as congestion.
  • BBR is the best heuristic option for 5G but still has limits.
  • DRL CC can learn the 5G-specific loss / delay distribution directly from experience.

The research claim: a DRL CC trained on 5G-RAN-shaped environments outperforms BBR on that workload because the learned policy is tuned to the specific noise / blockage / handover profile.

Open problems

  • Generalisation across networks. A DRL policy trained on one network type (e.g. 5G mid-band) may underperform on another (e.g. wired datacentre). Multi-task / online- learning approaches are active research.
  • Safety / fairness with existing CC. A DRL CC must coexist with TCP-CUBIC flows in the internet today. Unfairness (DRL-agent hogging bandwidth) is an adoption blocker — similar to BBR v1's known CUBIC-coexistence issues.
  • Explainability / debuggability. A heuristic CC's behaviour is auditable by reading the algorithm; a DRL CC's is not. For a CDN running on thousands of POPs, incident-triage capability is load-bearing.
  • Reward shaping. The reward function must balance throughput, delay, and loss — different services want different trade-offs (e.g. video prefers low delay; bulk transfer prefers high throughput).

Wiki framing

DRL-for-CC is the research axis that QUIC's user-space architecture makes economic, the 5G-RAN bottleneck makes valuable, and heuristic CC's limits make necessary. Zalando presents it as a forward-looking research direction rather than a production system; this wiki entry follows that framing.

Seen in

Last updated · 501 distilled / 1,218 read