Skip to content

AWS 2026-04-08 Tier 1

Read original ↗

Build a multi-tenant configuration system with tagged storage patterns

Summary

AWS Architecture Blog walkthrough of a multi-tenant configuration service built on two heterogeneous storage backends behind a NestJS gRPC service, with three architectural ideas worth extracting: (1) tagged-storage routing — a Strategy-Pattern-backed factory dispatches each configuration request to the storage backend that best fits its access pattern, keyed on the request key's prefix; (2) event-driven cache refresh — EventBridge watches Parameter Store changes and triggers a Lambda that invalidates in-memory caches on live service instances over gRPC, eliminating the TTL-vs-staleness dilemma for shared config without polling or restarts; (3) JWT-only tenant extraction — the service never reads tenantId from request parameters; it is extracted exclusively from the validated Cognito JWT's custom:tenantId immutable claim, making cross-tenant access structurally impossible even if request bodies are manipulated.

The post is a reference architecture with a GitHub CloudFormation sample; no production scale numbers (RPS, tenant count, latency percentiles) are disclosed. It earns Tier-1 treatment because the three component patterns generalize well beyond the specific stack — prefix-based storage routing, push-invalidation of service-local caches, and JWT-sourced tenant context appear across the wiki already, and this source consolidates them into one canonical composite shape for configuration services specifically.

Architecture at a glance

Client → Cognito (JWT with custom:tenantId) → WAF → API Gateway
      → VPC Link → ALB
      → ECS Fargate tasks in private subnets
         ├── Order Service  (REST; delegates to Config via gRPC)
         └── Config Service (gRPC)
            ├── CognitoJwtGuard       (validates JWT against JWKS)
            ├── TenantAccessGuard     (tenant membership check)
            ├── ConfigStrategyFactory (examines key prefix)
            │    ├── tenant_config_*  → DynamoDB strategy
            │    └── param_config_*   → Parameter Store strategy
            └── In-memory cache (per strategy, different TTL profile)

Parameter Store change
      → EventBridge rule (/config-service/* path match)
      → Lambda (extracts tenantId from path)
      → AWS Cloud Map lookup (healthy Config Service instances)
      → gRPC refresh call to each instance
      → In-memory cache updated (zero downtime)

Service discovery: AWS Cloud Map. Observability: CloudWatch. Edge: WAF + API Gateway.

Key takeaways

  1. Tagged-storage routing is a generalization of key-prefix-based backend dispatch. Configuration keys start with tenant_config_ (→ DynamoDB, high-frequency per-tenant) or param_config_ (→ Parameter Store, shared hierarchical, infrequent-write). A Strategy-Pattern factory maps prefix → strategy at request time; adding a third backend (e.g., Secrets Manager, S3 for large blobs) is a new strategy class + one entry in the keyStrategyMap, with no changes to existing strategies or calling code. (Source: sources/2026-04-08-aws-build-a-multi-tenant-configuration-system-with-tagged-storage-patterns §B). See patterns/tagged-storage-routing.

  2. The TTL-vs-staleness dilemma is the forcing function for event-driven refresh. Traditional caches force a choice between stale tenant context (risking incorrect data isolation or feature flags) and aggressive invalidation (which sacrifices performance and amplifies load on the metadata service). The post frames this as an either/or that becomes untenable as tenant counts grow into the hundreds or thousands. The escape is to make invalidation reactive rather than temporal: EventBridge monitors Parameter Store, a Lambda receives the change event and pushes the update to live service instances over gRPC. See concepts/cache-ttl-staleness-dilemma + patterns/event-driven-config-refresh.

  3. Polling and restart-based updates are explicitly named as broken alternatives. Polling generates unnecessary API calls that cost money even when nothing changes, plus multi-second-to-minute latency between change and service visibility. Service restarts drop active connections and disrupt user sessions — "unacceptable" for 24/7 SaaS. The event-driven path eliminates both failure modes with "no service restarts, connections remain active, updates within seconds". (Source §D)

  4. Tenant isolation is enforced at the identity layer, not the application layer, by never accepting tenantId from request parameters. The article calls this out as a "critical security design: the service never accepts tenantId from request parameters. Instead, it extracts the tenant context from validated JWT tokens." Even if a user manipulates the request body to reach another tenant's data, the query still uses the JWT's tenant claim. The custom Cognito attribute custom:tenantId is declared immutable at user-pool creation, cryptographically binding the tenant to the identity at the identity-provider level. See patterns/jwt-tenant-claim-extraction.

  5. Two complementary storage backends reflect two different access-pattern tiers. DynamoDB for tenant-specific, high-frequency per-request reads with composite-key isolation (pk = TENANT#{id}, sk = CONFIG#{type}); Parameter Store for hierarchical shared parameters retrievable in bulk via GetParametersByPath (/config-service/{tenantId}/{service}/{parameter}). The post explicitly frames this as "routing to optimized backends alleviates both DynamoDB cost explosions (for rarely-changing configs) and Parameter Store throttling (for high-frequency reads)" — single-backend solutions lose on one side or the other.

  6. Cache-security discipline for in-memory multi-tenant caches: key values with tenant-prefixed composite keys (tenantId:serviceName:configKey) and never cache sensitive values. The in-memory map holds configuration metadata only (API endpoints, feature flags, thresholds); credentials and PII stay in Parameter Store's SecureString type and are fetched on demand. The post names the final-enforcement-boundary layered-defense posture explicitly: "downstream access controls (JWT validation, DynamoDB composite keys) act as the final enforcement boundary." (Source §B)

  7. Multi-dimensional tenant context is the escalation path for service-level isolation beyond tenant-level. Partition key becomes TENANT#{id}|SERVICE#{name}; the Order service sees billing-API configs while the Reporting service doesn't see payment gateway settings. This is a pre-structural-isolation knob — still application-level enforcement, still a shared execution role — but bounds the application-layer blast radius within the shared-account model.

  8. Infrastructure-level credential isolation via Token Vending Machine + STS is explicitly pitched as "a next step when compliance auditors require infrastructure-level separation, rather than a baseline requirement." The TVM pattern issues temporary, tenant-scoped IAM credentials so per-tenant CloudTrail audit trails and principal-of-least-privilege enforcement hold at the AWS-credential layer — at 50-100ms-per-operation latency cost plus the operational overhead of credential caching + STS API charges + token refresh. Canonical upgrade from in-account multi-layer toward account-per-tenant.

Operational numbers

The post does not disclose production numbers — no tenant count, RPS, p50/p99 latency, cache hit rate, DynamoDB/Parameter Store read/write budgets, or cost metrics. Only relative qualitative framing:

  • Configuration updates "within seconds" through the EventBridge path.
  • "Thousands of times per minute" cited as the access-frequency description for tenant-specific configs (the DynamoDB path).
  • "Dozens to a single request" for the Parameter Store hierarchical bulk-retrieval win via GetParametersByPath.
  • TVM latency: "50-100ms per operation" added.
  • ElastiCache (Redis OSS) / Valkey alternative cached-read latency: "1-3ms network latency versus sub-millisecond in-memory access".

At 1000+ RPS with sub-ms targets the post names DAX as the canonical acceleration layer (microsecond reads, 5-10× over DynamoDB's single-digit-ms baseline).

Caveats

  • Reference-architecture post with a GitHub CloudFormation sample, not a production retrospective. No failure-mode autopsy, no cost data, no scale numbers, no incident examples.
  • The "zero-downtime" framing assumes the EventBridge + Lambda + gRPC refresh path is itself reliable; failure modes (EventBridge delivery latency percentiles, Lambda cold starts on rare changes, gRPC refresh failures against unhealthy instances, partial-fleet refresh states) are not discussed.
  • JWT-sourced tenant isolation relies on the Cognito token lifetime (typically minutes to hours). Attribute changes mid-session require a token refresh; the post doesn't discuss revocation latency or how tenant membership changes are propagated before token expiry.
  • The single shared IAM execution role for ECS tasks means CloudTrail attributes all config-service AWS API calls to one principal — tenant-level audit trail exists only at the application log layer, not at the AWS API layer. TVM is the path to close this gap (called out explicitly).
  • No discussion of multi-region / cross-region replication of config data. Single-region assumption throughout.

Relationship to existing wiki content

  • Direct generalization of the prefix-routing shape seen in Figma FigCache (Redis command dispatch on key prefix) and prefix-aware routing. The config-service version is the same primitive applied to storage-backend selection rather than cache-cluster selection.
  • Multi-layer tenant isolation without per-tenant AWS accounts: sits between Convera (in-account per-tenant AVP policy stores + zero-trust re-verification) and ProGlove (account-per-tenant). This source's shape is the lightest of the three — single shared account, application-layer + JWT-claim enforcement, no per-tenant AWS resources.
  • Event-driven cache invalidation is the same shape as Figma LiveGraph's stateless schema-aware invalidator — WAL-tail → broadcast invalidations over dedicated channels to cache replicas. The Config Service variant is coarser (EventBridge events vs Postgres WAL tail) but structurally identical: a separate component observes the source of truth and pushes invalidations to stateless caches. See concepts/invalidation-based-cache.
  • Strategy Pattern as the vehicle for pluggable storage backends is a well-worn OO pattern; the wiki-relevant generalization is "route on content-addressable prefix so the dispatch is O(1) + the key itself documents which backend owns it." See patterns/tagged-storage-routing.

Raw

See raw/aws/2026-04-08-build-a-multi-tenant-configuration-system-with-tagged-storag-044729fd.md. Original post: aws.amazon.com/blogs/architecture/build-a-multi-tenant-configuration-system-with-tagged-storage-patterns

Last updated · 200 distilled / 1,178 read