Skip to content

PATTERN Cited by 1 source

Per-service config aggregator

Per-service config aggregator: when a central infrastructure service is configured by many tenant teams, shard the configuration into one file per tenant service owned by that tenant's team, and run a lightweight aggregator service that assembles the tenant files into the central service's operational mega-config in real-time.

Problem

The obvious first model: one big config file for the infrastructure service, edited directly by the infra team (or by whichever tenant team needs a change). Grows unusable at scale:

  • Blast radius: one tenant's breaking change forces the infra team to roll back, but the current config now contains many other tenants' changes since the last known-good point. Rollback becomes a coordination problem between all tenants.
  • Team load: the infra team becomes an unavoidable intermediary for every tenant change. Rate-limited by infra-team bandwidth, even when the change is purely tenant-scoped.
  • Push cadence: pushing the mega-config takes hours across DCs (safety windows). Bundling many changes per push multiplies the risk-per-push.
  • Authorship clarity: a single file edited by everyone makes the "who owns this knob for which service" question unanswerable without git archaeology.

Pattern

  1. Per-service files. Each tenant service keeps its own config (for the infra service) in a file next to its source code. Team X owns service X's infra-config file; team Y owns Y's. Normal code-review and deployment gates apply.
  2. Aggregator service. A lightweight service watches all per-tenant config files (commit-triggered or filesystem-watching) and computes the mega-config the infrastructure service actually consumes. The aggregator has no domain logic — its only job is merging N inputs into 1 output, with deterministic precedence rules for any shared keys.
  3. Fast incremental rebuild. Change in service X's file triggers a fresh mega-config computation within seconds; minimal blast radius.
  4. Tombstoning on delete. Deleting a per-service file doesn't immediately drop the entry from the mega-config. The aggregator marks the entry tombstoned for a configurable window (days) before reclaiming, so accidental deletes are reversible and race conditions between config sources are avoided when downstream systems still reference the service.

What this gives you

  • Per-tenant rollback. Revert service X's config file — only service X's behavior changes.
  • Tenant self-service. Service X's team ships infra-config changes on their own release cadence, without infra-team review of each change.
  • Decoupled push cadence. Incremental mega-config updates propagate in seconds; no bundled hours-long pushes.
  • Clear ownership. git blame service-X.yaml identifies owners without cross-service inference.
  • Safer deletes. Tombstoning turns "I accidentally deleted this service's entry" from an outage into a noop (until the reclaim window).

Design considerations

  • Shared-key policy. If two tenant files write to the same logical key (shouldn't happen in well-factored schemas, but does), the aggregator needs a deterministic precedence rule or schema that prevents it.
  • Consistency of the mega-config. Aggregator writes must be atomic from the infra service's perspective — no torn reads seeing half an update.
  • Backups / versioning. If the underlying config store isn't versioned, periodic backups are the minimal compensator for rollback beyond the tombstone window.
  • Validator placement. Schema validation should run at tenant-file commit time (fail before merge) rather than in the aggregator (so bad configs never reach the mega-config).
  • Monotonic update semantics. When the infra service restarts, the aggregator should be deterministic in the mega-config it produces, so recovery converges.
  • Tombstone window tuning. Too short → still races with downstream configs. Too long → stale entries pollute the config and confuse operators. Days is typical for cross-system references.

Variations

  • Git-backed vs. API-backed. Per-service files can live in a monorepo (commit is the write; review is the ship gate) or in a dedicated config-management API. patterns/git-based-config-workflow is the former.
  • Validate vs. compile. Some aggregators just concatenate with validation; others compile (e.g. translate declarative tenant config into a lower-level imperative mega-config). Compile mode gives the infra team more control at the cost of aggregator complexity.
  • Per-environment. Tenant files may have dev/staging/prod sections; aggregator assembles per-environment mega-configs.

Relationship to other patterns

  • patterns/git-based-config-workflow: orthogonal — the per-service files are usually git-managed, but this pattern is about the sharding dimension, not about the write medium.
  • concepts/ownership: this pattern operationalizes ownership — each tenant service's team owns their config file end-to-end.
  • concepts/control-plane-data-plane-separation: the aggregator + infra service is a control/data split — aggregator decides, infra serves. A bad aggregator change affects new tenant edits; the infra data plane keeps running on last-known-good.
  • patterns/fast-rollback: enables per-tenant fast rollback (single-file revert) where mega-config rollback would be coarse.

Seen in

  • sources/2024-10-28-dropbox-robinhood-in-house-load-balancingRobinhood's config-aggregator service. Problem stated verbatim: "it's risky to press the rollback button because we don't know how many other services have also made changes since the last push"; "the Robinhood team would have to get involved in every breaking config push — which is a waste of engineering time, since most incidents can be resolved by the service owner"; "each push takes hours to deploy to multiple data centers in order to minimize potential risks." Fix: per-service Robinhood configs, aggregator assembles the mega-config, per-service rollbacks unblock the Robinhood team. Tombstone feature added to handle the race between Robinhood config removal and downstream Envoy configs still referencing the service.
Last updated · 200 distilled / 1,178 read