PATTERN Cited by 1 source
Per-service config aggregator¶
Per-service config aggregator: when a central infrastructure service is configured by many tenant teams, shard the configuration into one file per tenant service owned by that tenant's team, and run a lightweight aggregator service that assembles the tenant files into the central service's operational mega-config in real-time.
Problem¶
The obvious first model: one big config file for the infrastructure service, edited directly by the infra team (or by whichever tenant team needs a change). Grows unusable at scale:
- Blast radius: one tenant's breaking change forces the infra team to roll back, but the current config now contains many other tenants' changes since the last known-good point. Rollback becomes a coordination problem between all tenants.
- Team load: the infra team becomes an unavoidable intermediary for every tenant change. Rate-limited by infra-team bandwidth, even when the change is purely tenant-scoped.
- Push cadence: pushing the mega-config takes hours across DCs (safety windows). Bundling many changes per push multiplies the risk-per-push.
- Authorship clarity: a single file edited by everyone makes the "who owns this knob for which service" question unanswerable without git archaeology.
Pattern¶
- Per-service files. Each tenant service keeps its own config (for the infra service) in a file next to its source code. Team X owns service X's infra-config file; team Y owns Y's. Normal code-review and deployment gates apply.
- Aggregator service. A lightweight service watches all per-tenant config files (commit-triggered or filesystem-watching) and computes the mega-config the infrastructure service actually consumes. The aggregator has no domain logic — its only job is merging N inputs into 1 output, with deterministic precedence rules for any shared keys.
- Fast incremental rebuild. Change in service X's file triggers a fresh mega-config computation within seconds; minimal blast radius.
- Tombstoning on delete. Deleting a per-service file doesn't immediately drop the entry from the mega-config. The aggregator marks the entry tombstoned for a configurable window (days) before reclaiming, so accidental deletes are reversible and race conditions between config sources are avoided when downstream systems still reference the service.
What this gives you¶
- Per-tenant rollback. Revert service X's config file — only service X's behavior changes.
- Tenant self-service. Service X's team ships infra-config changes on their own release cadence, without infra-team review of each change.
- Decoupled push cadence. Incremental mega-config updates propagate in seconds; no bundled hours-long pushes.
- Clear ownership.
git blame service-X.yamlidentifies owners without cross-service inference. - Safer deletes. Tombstoning turns "I accidentally deleted this service's entry" from an outage into a noop (until the reclaim window).
Design considerations¶
- Shared-key policy. If two tenant files write to the same logical key (shouldn't happen in well-factored schemas, but does), the aggregator needs a deterministic precedence rule or schema that prevents it.
- Consistency of the mega-config. Aggregator writes must be atomic from the infra service's perspective — no torn reads seeing half an update.
- Backups / versioning. If the underlying config store isn't versioned, periodic backups are the minimal compensator for rollback beyond the tombstone window.
- Validator placement. Schema validation should run at tenant-file commit time (fail before merge) rather than in the aggregator (so bad configs never reach the mega-config).
- Monotonic update semantics. When the infra service restarts, the aggregator should be deterministic in the mega-config it produces, so recovery converges.
- Tombstone window tuning. Too short → still races with downstream configs. Too long → stale entries pollute the config and confuse operators. Days is typical for cross-system references.
Variations¶
- Git-backed vs. API-backed. Per-service files can live in a monorepo (commit is the write; review is the ship gate) or in a dedicated config-management API. patterns/git-based-config-workflow is the former.
- Validate vs. compile. Some aggregators just concatenate with validation; others compile (e.g. translate declarative tenant config into a lower-level imperative mega-config). Compile mode gives the infra team more control at the cost of aggregator complexity.
- Per-environment. Tenant files may have dev/staging/prod sections; aggregator assembles per-environment mega-configs.
Relationship to other patterns¶
- patterns/git-based-config-workflow: orthogonal — the per-service files are usually git-managed, but this pattern is about the sharding dimension, not about the write medium.
- concepts/ownership: this pattern operationalizes ownership — each tenant service's team owns their config file end-to-end.
- concepts/control-plane-data-plane-separation: the aggregator + infra service is a control/data split — aggregator decides, infra serves. A bad aggregator change affects new tenant edits; the infra data plane keeps running on last-known-good.
- patterns/fast-rollback: enables per-tenant fast rollback (single-file revert) where mega-config rollback would be coarse.
Seen in¶
- sources/2024-10-28-dropbox-robinhood-in-house-load-balancing — Robinhood's config-aggregator service. Problem stated verbatim: "it's risky to press the rollback button because we don't know how many other services have also made changes since the last push"; "the Robinhood team would have to get involved in every breaking config push — which is a waste of engineering time, since most incidents can be resolved by the service owner"; "each push takes hours to deploy to multiple data centers in order to minimize potential risks." Fix: per-service Robinhood configs, aggregator assembles the mega-config, per-service rollbacks unblock the Robinhood team. Tombstone feature added to handle the race between Robinhood config removal and downstream Envoy configs still referencing the service.