Skip to content

PATTERN Cited by 1 source

Central telemetry aggregation

Pattern

In a multi-account platform (especially account-per-tenant), forward logs / metrics / traces from every source account into a single central aggregation tier, define multi-alerts once against the aggregated data, and present a single pane of glass to engineers — while the underlying telemetry still originates from isolated accounts. Designed to recover cross-fleet visibility without re-coupling the accounts the isolation architecture was built to separate.

Canonical form

"Observability tooling should be centralized, but without reintroducing the very risks that accounts are meant to isolate." (Source: sources/2026-02-25-aws-6000-accounts-three-people-one-platform)

ProGlove's shape: forward logs + metrics to a central third-party observability application; multi-alerts are defined once and applied across tenant accounts individually. Engineers see the aggregated view; raw telemetry still lives per-account.

Load-bearing sub-prescriptions

  1. Don't replicate per-account alarms blindly — if you naively fan the same alarm out to every tenant account, you drown in alerts proportional to N_accounts, and alert fatigue makes the fleet less observable, not more. Use streaming + aggregation at the central tier; define threshold-breach logic once against the aggregated streams.
  2. Tag for context. Every metric and log must carry the source AWS account ID (and tenant-id / service / Region / environment tags) so that aggregated views can drill into single-tenant problems without losing the cross-fleet summary.
  3. Enforce tagging consistency. Consider AWS Organizations tag policies to enforce a consistent scheme — the aggregation layer is only as useful as the discipline of its inputs.
  4. Stay current with AWS primitives. AWS's Observability Access Manager, CloudWatch metric streams, and EventBridge integrations are all evolving and may reduce custom-pipeline surface area.

(All Source: sources/2026-02-25-aws-6000-accounts-three-people-one-platform)

What "without reintroducing the risks" means in practice

  • Read-only aggregation access. The central observability role assumes cross-account read-only roles into source accounts; it cannot mutate the source accounts. Anything richer (e.g. admin-level CloudWatch access) re-opens the blast-radius boundary.
  • Central-tier compromise is a fleet-wide incident. The single pane of glass is also a single point of compromise for read visibility; harden the aggregation account as you would a production-critical service, not as an internal tool.
  • No back-channels. A legitimate "we need to act on this alert" path must go through the same per-account access controls (ChatOps / break-glass / scoped roles), not the aggregation layer's own credentials.

Scale signal

Per-account cost of telemetry is the enemy at high account count: "the volume of collected data can make per-account costs economically unsustainable. Instead, focus on understanding which metrics you need to monitor and select an observability approach that allows you to implement that." (Source: sources/2026-02-25-aws-6000-accounts-three-people-one-platform)

Canonical downstream heuristics:

  • Sample high-volume low-signal metrics per-account before forwarding.
  • Aggregate at the edge (per-account) so only rolled-up streams leave the account.
  • Tier storage at the central side (hot → cold) with short retention for raw per-account streams and longer retention only for aggregated derived metrics.

Where AWS-native OAM fits

At build time, ProGlove used a third-party tool. OAM has since shipped and offers the same architectural shape (cross-account read-only visibility, no telemetry copying) as an AWS-native alternative. For new platforms, OAM is the starting point.

Seen in

Last updated · 200 distilled / 1,178 read