Skip to content

AWS 2026-02-04 Tier 1

Read original ↗

AWS: Mastering millisecond latency and millions of events — the event-driven architecture behind the Amazon Key Suite

Summary

AWS Architecture Blog post by the Amazon Key team on modernizing their access-management platform (In-Garage Delivery, apartment-building access for property managers) from a tightly coupled monolithic system into a decoupled event-driven architecture on Amazon EventBridge. Core architectural move is the single-bus, multi-account pattern: per- service teams own application stacks while a central DevOps team owns the event bus + rules + target configurations. To close three gaps EventBridge doesn't address natively the team built a custom event-schema repository (single source of truth, JSON-Schema Draft-04, versioned, code-bindings at build time), a client library with built-in schema validation + serde, and a CDK-based subscriber constructs library that provisions per-subscriber event bus + IAM roles + monitoring/alerting from ~5 lines of code. Reported production results: 2,000 events/second, 99.99% success rate, 80ms p90 end-to-end latency across 14M subscriber calls, five-day → one-day integration time for new use cases, 48h → 4h new-event onboarding, 40h → 8h publisher/subscriber integration, 100% single-control-plane governance of event-bus infrastructure.

Key takeaways

  1. "Loose event schemas" is a coordinating cost, not a flexibility feature. Pre-migration events were loosely typed, un-documented, and un-validated; breaking changes were "almost impossible to implement" (can't safely remove unused fields); there was no canonical place for schema modifications across teams; publishers couldn't detect invalid events before emission. The team reframed schema repository as foundational infrastructure on par with the message broker itself. (Source: sources/2026-02-04-aws-amazon-key-eventbridge-event-driven-architecture)
  2. Evaluated centralized validation service vs client-side; chose client-side. Centralized validation would have added an extra network hop + its own infra-scaling problem. Client-side validation via the shared client library consuming the schema repository gave immediate developer feedback + kept the validation path off the runtime critical path. Canonical patterns/client-side-schema-validation.
  3. EventBridge has schema discovery + a schema registry — but no native validation. "EventBridge provides developers with tools to implement validation using external solutions or custom application code, it currently does not include native schema validation capabilities." So schema validation is an application-layer responsibility that customers with strict-validation requirements must build. The team built a custom schema repository (distinct from the EventBridge schema registry) specifically because schema validation was a critical requirement.
  4. Single-bus multi-account split of ownership. Service teams own their application stacks (business logic, publishing code, consumer code); central DevOps team owns a single event bus + event-bus rules
  5. target configurations + service integrations. Named benefits: clear ownership boundaries, centralized governance (consistent routing + security + monitoring), simplified operations (one bus with logical separation via rules), enhanced security (multi-account natural isolation + controlled cross-account event flow), streamlined compliance (centralized data-exchange patterns). Pattern link: AWS single-bus-multi-account reference pattern.
  6. Client library as productivity multiplier. Four responsibilities: (a) generate code bindings at build time from schemas → type-safe event creation; (b) validate before publish; (c) serialize + publish to the bus; (d) deserialize for subscribers into usable formats. Claim: "standardized client library addressed 90% of common integration errors."
  7. Subscriber constructs library (AWS CDK) as the second productivity multiplier. A new Subscription(scope, id, { name, application: { region } }) ~5-line CDK construct provisions the dedicated subscriber event bus + cross-account IAM roles + standardized monitoring + alerting. Subscribers focus on business logic instead of infrastructure boilerplate. Canonical patterns/reusable-subscriber-constructs.
  8. The failure mode the migration eliminated: cross-service cascade. "An issue in Service-A triggered a cascade of failures across many upstream services, with increased timeouts leading to retry attempts and ultimately resulting in service deadlocks." Single-device-vendor issues caused fleet-wide degradation despite scope being limited to one delivery operation. This is the classic tight-coupling blast-radius pattern; event-driven architecture isolates it via async decoupling + per-subscriber queues absorbing publisher failures.
  9. Why ad-hoc SNS/SQS pairs were not the answer. The team explicitly calls out prior attempts at per-integration SNS/SQS pairs between services as "implemented on an ad-hoc basis, lacking standardization and creating additional maintenance overhead" — redundant work, no shared abstraction. The event bus provides the shared abstraction that per-pair pub/sub was failing to deliver at Amazon Key's scale.

Architectural facts & numbers

  • Throughput: 2,000 events/second.
  • Success rate: 99.99%.
  • Latency: 80ms p90 from event ingestion to target invocation.
  • Subscriber calls: 14,000,000 (span unspecified — likely cumulative over the reporting window).
  • Integration time for new use cases: 5 days → 1 day (80% reduction).
  • New event onboarding in the custom schema repository: 48h → 4h.
  • Publisher/subscriber integration: 40h → 8h.
  • Integration-error coverage: "standardized client library addressed 90% of common integration errors".
  • Schema format: JSON Schema Draft-04 ($schema": "http://json-schema.org/draft-04/schema#").
  • Event type references: $ref into per-type schema files (EventType.json, ../core/Publisher.json) for inheritance / composition across schemas.
  • Required event fields (from the example): id, type, time (ISO 8601), publisher.
  • Control plane: single centralized DevOps-owned stack for 100% of event-bus infrastructure.
  • Compliance-check coverage: 100% of unauthorized data-exchange patterns caught by automated security compliance checks.
  • Ownership model: multi-account per service team + single centralized event-bus stack.

Systems / concepts / patterns extracted

  • Systems (new):
  • systems/amazon-eventbridge — AWS's managed serverless event bus; rule-based routing, schema registry + schema discovery (no native validation), supports cross-account targets via resource policies.
  • systems/amazon-key — Amazon's physical-access-management product family (In-Garage Delivery, apartment-building access); the production instance of the pattern in this post.
  • Systems (existing, extended):
  • systems/aws-sns — explicitly named in the "ad-hoc SNS/SQS pairs" anti-pattern that motivated EventBridge adoption; extend with "event-bus as superseding abstraction" note.
  • systems/aws-sqs — same; typically paired with SNS for producer/consumer decoupling, but without the schema-governance or routing-rule surface EventBridge provides.
  • Concepts (new):
  • concepts/event-driven-architecture — services communicate via asynchronous events on a shared bus instead of synchronous request/response; producers don't know consumers; consumers control their own queue depth / retention.
  • concepts/schema-registry — single source of truth for event definitions; enables validation, versioning, deprecation paths, and cross-team collaboration on event contracts.
  • concepts/service-coupling — degree to which services depend on each others' implementation, availability, and behavior; tight coupling creates cascade-failure blast radius.
  • Patterns (new):
  • patterns/single-bus-multi-account — one shared event bus in a central account + per-service-team accounts owning their application stacks; DevOps owns the bus + rules, service teams own the code; AWS reference pattern.
  • patterns/client-side-schema-validation — validate events in a shared client library at publish/subscribe boundary rather than via centralized validation service; immediate developer feedback + no runtime network hop.
  • patterns/reusable-subscriber-constructs — ship subscriber integration as a versioned IaC construct library (AWS CDK here) that provisions per-subscriber event bus + cross-account IAM + monitoring
    • alerting from a few lines of code.

Caveats

  • Marketing-leaning post — architectural blog format, numbers presented as outcomes with no comparison baseline, no p50/p99 / distribution shape, no cost breakdown.
  • No failure-mode analysis for the new architecture: DLQ design, schema-validation-failure behavior, retry-and-poison-pill handling, cross-account IAM incidents.
  • No schema-repository implementation details — described in terms of responsibilities (single source of truth, versioning, release management, audit trails) but architecture (storage, API, delivery mechanism for build-time code bindings, how generated bindings get to publisher/subscriber CI pipelines) not disclosed.
  • Client-library implementation language(s) not disclosed.
  • Schema-evolution policy named but not detailed: "clear deprecation policies and migration paths" / "detect breaking changes early" — no concrete compatibility matrix (backward / forward / full) or required-field-addition policy stated.
  • No quantitative comparison against ad-hoc SNS/SQS pairs or legacy monolith — relative wins framed qualitatively.
  • 14M subscriber calls context ambiguous (per day? per window?); the 2,000 events/sec throughput × 99.99% × subscriber-fanout would imply short reporting windows but isn't stated.
  • Single-device-vendor cascade described as an incident but no COE linked; the retrospective framing suggests pre-migration but isn't explicit.
Last updated · 200 distilled / 1,178 read