Skip to content

PATTERN Cited by 1 source

Event-driven config refresh

Shape

A reactive cache-invalidation pipeline for configuration data that eliminates the TTL-vs-staleness dilemma by pushing updates from the config source to live service instances the moment a change commits — no polling, no restart, no downtime.

Canonical AWS realization (from the multi-tenant-config post):

1. Config write to Parameter Store (or any EventBridge-emitting source)
2. EventBridge rule matches a content filter (e.g. path prefix)
3. Rule fires a Lambda, passing the full change event
4. Lambda extracts the affected tenant/scope from the event payload
5. Lambda queries AWS Cloud Map for healthy service instances for that scope
6. Lambda makes a gRPC refresh call to each instance in parallel
7. Each instance updates its in-memory cache synchronously on the refresh RPC
8. Updated configuration is live across the fleet (within seconds)

(Source: sources/2026-04-08-aws-build-a-multi-tenant-configuration-system-with-tagged-storage-patterns §D)

The problem being solved

Traditional configuration-refresh strategies force an unacceptable either/or as tenant counts grow:

  • Polling — services re-read the config source on a schedule, generates API calls and money spent even when nothing changes, introduces a staleness window equal to the poll interval (seconds to minutes).
  • TTL-based caches — same staleness window as the polling variant, plus tighter coupling between read latency and refresh freshness.
  • Service restart on change — no staleness but drops active connections, disrupts user sessions; unacceptable for 24/7 SaaS.

The event-driven path resolves this: updates are reactive, bounded in latency by EventBridge delivery + Lambda cold start + fleet-wide gRPC fan-out (typically single-digit seconds), and the cache is invalidated in place on a live service without dropping in-flight requests.

Key mechanisms

Content-based routing via EventBridge rule. The rule filters on event path / source, so a single bus can carry changes for many config scopes and the Lambda only fires for the relevant subset. Content- based routing is what makes this pattern shine over raw SNS fanout — subscribers subscribe by event shape, not topic.

Service discovery at refresh time. The Lambda doesn't have a static list of service instances — it queries AWS Cloud Map at refresh time so auto-scaled / replaced / draining instances are handled automatically. The refresh fan-out is always to the current healthy set.

Direct gRPC call, not broadcast. The refresh is a point-to-point RPC per instance, not a pub/sub broadcast. This gives synchronous acknowledgment per instance (did the refresh succeed?) and explicit failure handling per instance (retry, circuit-break, log).

In-memory cache update, not restart. The refresh RPC's handler mutates a per-process in-memory map (or equivalent) under a lock, so in-flight requests continue serving the old value until the swap commits, then new reads see the new value. No connection drops; no restart.

Comparison to polling-based refresh

Axis Polling Event-driven
Staleness window = poll interval Bounded by delivery + fan-out latency (seconds)
API calls when idle N services × poll rate 0
API calls on change N services × poll rate (amortized) 1 event + N gRPC refreshes (per change)
Active-connection impact None None (in-place cache update)
Operational complexity Low (just a schedule) Higher (EventBridge rule + Lambda + Cloud Map + gRPC endpoint)
Observability Poll success rate Event-delivery metrics + Lambda invocations + refresh-RPC status

Event-driven wins whenever writes are rare compared to reads (most config data) — polling cost is wasted the moment nothing changed.

  • patterns/stateless-invalidator (Figma LiveGraph) — same structural idea at a very different substrate. LiveGraph tails Postgres WAL per physical shard, emits invalidations over Kafka into cache replicas. The Config Service variant uses EventBridge events from Parameter Store and gRPC point-to-point invalidation. Both: a separate component observes the source of truth, pushes invalidations, and the cache layer is stateless. The difference is granularity — LiveGraph does per-row / per-query invalidation, the Config Service does per-config-key refresh.
  • concepts/push-based-invalidation — the concept-level framing this pattern implements on AWS managed services.
  • concepts/invalidation-based-cache — the target cache model.

Failure-mode surface

  • EventBridge delivery failures — rare but non-zero; the rule can land in DLQ. A caller still holding stale cache sees old data until next natural refresh (application-level TTL on the cache entry as a safety net).
  • Lambda cold start on rare-event path — refresh latency spikes when the trigger fires after a quiescent period. Provisioned concurrency mitigates at cost.
  • Cloud Map stale discovery — freshly-terminating instances may still be in the healthy set; the refresh RPC against them fails and is swallowed. Acceptable (the instance is draining).
  • Per-instance refresh failures — partial-fleet refresh is a real state (some instances see new config, others still see old). The pattern doesn't make this atomic; application-level tolerance for brief per-instance divergence is required.
  • Event ordering — EventBridge delivery is at-least-once, not FIFO. Rapid successive writes can fire refreshes out of order; the Lambda should fetch the current value on refresh (not rely on the event payload) to collapse bursts.

Implementation checklist

  1. Config source emits typed change events. Parameter Store ships this natively; other sources (DynamoDB Streams → EventBridge Pipes, custom applications calling PutEvents) need a glue step.
  2. EventBridge rule with content-based filter scoped to the config-service's paths / keys. Avoid catch-all rules — per-event-shape rules localize Lambda firing.
  3. Lambda is stateless and idempotent. Fetches current value from the config source, fans out to Cloud Map instances, refreshes each. Idempotency handles retried events.
  4. Service exposes a refresh RPC endpoint (gRPC / HTTP). Handler authenticates the caller (Lambda's IAM role), looks up the new value, mutates the in-memory cache under a lock.
  5. Cloud Map service registration by every service instance at boot; deregistration on shutdown.
  6. Monitor end-to-end latency: change commit → refresh applied on last instance. This is the SLO for "how stale can my cache be".
  7. Keep an application-level TTL as belt-and-suspenders. If the push path fails silently, the cache eventually refreshes on its own — staleness bounded by TTL, not infinite.

Seen in

Last updated · 200 distilled / 1,178 read