PATTERN Cited by 1 source

Pull-on-demand replacing push¶

Intent¶

Replace a push-from-control-plane-to-data-plane provisioning flow with a pull-from-data-plane-to-control-plane flow triggered by demand signals. Eliminate an entire class of delivery-guarantee problems by moving configuration to the data plane only when the data plane actually needs it.

Motivation¶

Push-based provisioning has a recurring failure mode: the push RPC is lost, so the data plane doesn't have what the control plane thinks it has, and the failure surfaces as "first attempt doesn't work, eventually recovers". Retries + reconciliation + periodic full-syncs are the usual work-arounds, all of which add complexity and stale-data windows.

Pull-on-demand eliminates the problem:

The data plane fetches when it's already committed to needing the data (e.g. a packet just arrived for a peer it doesn't know).
No RPC can be "lost" because there's no async RPC in the critical path; the pull is synchronous with the need.
Stale state on the data plane is self-correcting: any stale peer config gets re-pulled when the client reconnects.

Shape¶

Before (push):
  Control plane ---- RPC push ----> Data plane
                        |
                    (may drop)
                        |
                        v
                   divergent state

After (pull):
  Data plane ---- event (packet) ----> Data plane
                                           |
                                           v
                         HTTP GET /peer/:pubkey
                                           |
                                           v
                                    Control plane

Canonical instance — Fly.io JIT WireGuard¶

Fly.io's WireGuard gateways switched from push (over NATS) to pull (internal HTTP) for peer configs:

"Our NATS cluster was losing too many messages to host a reliable API on it. Scaling back our use of NATS made WireGuard gateways better, but still not great. ... Wouldn't it be nice if we just didn't have this problem? What if, instead of pushing configs to gateways, we had the gateways pull them from our API on demand?" (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

And:

"For instance, our internal flyd API used to be driven by NATS; today, it's HTTP." (Source: sources/2024-03-12-flyio-jit-wireguard-peers)

Post-switchover, dropped-message pathologies on the peer provisioning path disappear — there is no push to drop. The pull is request/response HTTP between the gateway and the Fly GraphQL API; retries, idempotency, and caching are standard HTTP tooling.

Second-order benefits¶

Once the provisioning flow is pull, eviction becomes free: any evicted peer that's still wanted will pull again next time. This unlocks state-eviction cron as a safe, trivially-cheap way to keep the hot-set small — which was the key to sidestepping the kernel state capacity limit on WireGuard peers.

I.e. pull-on-demand doesn't just fix the delivery-guarantee problem; it unlocks the architectural response to the scale problem.

When not to¶

When the data plane can't observe the demand signal. Push is unavoidable when the data-plane side has no way to notice that a given piece of config is needed. (E.g. routing entries installed in anticipation of traffic that arrives with no observable precursor.)
When pull latency is unacceptable on the critical path. If the data plane must be ready before the first packet, not after it, push is the only option. JIT doesn't work for every protocol.
When the control plane is more-overloaded than the data plane. If the control plane is the bottleneck, pushing once to many is cheaper than having many pull once each. Broadcast-style systems don't flip easily to pull.

Seen in¶

sources/2024-03-12-flyio-jit-wireguard-peers — canonical wiki instance; WireGuard peer provisioning flipped from NATS-push to HTTP-pull.

concepts/jit-peer-provisioning — the specific application.
concepts/at-most-once-delivery — the delivery property the push path failed on.
systems/fly-gateway — the pull side.
systems/nats — deprecated push transport.
systems/fly-graphql-api — the control plane pulled from.
patterns/jit-provisioning-on-first-packet — the specific-case pattern.
patterns/state-eviction-cron — the cleanup primitive unlocked by pull.