Skip to content

PATTERN Cited by 1 source

CloudWatch Metric Streams to VPC OpenTelemetry collector

A composite reference architecture for push-based, vendor-neutral metric ingestion that keeps all data inside the customer's VPC. Five AWS primitives chained: CloudWatch Metric StreamsAmazon Data FirehoseLambda transform → internal NLBOpenTelemetry collector on EC2. The OpenTelemetry collector then fans out to multiple observability backends (Grafana Cloud, AWS X-Ray, Honeycomb, Lightstep, etc.) via its exporter stage.

This pattern is the canonical answer to three intersecting requirements: (1) push-based monitoring to escape pull-based API throttling; (2) vendor-neutral observability via OpenTelemetry's open formats; (3) VPC-internal data path for data-privacy / regulatory reasons.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                      AWS account / region                            │
│                                                                      │
│  ┌──────────────────────────────┐                                    │
│  │ CloudWatch                    │                                    │
│  │  └─ Metric Streams (JSON)     │                                    │
│  └──────────────┬───────────────┘                                    │
│                 │ push                                                │
│                 ▼                                                     │
│  ┌──────────────────────────────┐    ┌─────────────────────────┐     │
│  │ Amazon Data Firehose          │───▶│ S3 (fallback / 0-cost)   │     │
│  │  (delivery stream)            │    └─────────────────────────┘     │
│  └──────────────┬───────────────┘                                    │
│                 │ sync invoke (transform)                            │
│                 ▼                                                     │
│  ┌──────────────────────────────┐                                    │
│  │ Lambda transform (VPC-attached)                                   │
│  └──────────────┬───────────────┘                                    │
│                 │ HTTP push                                           │
│                 ▼                                                     │
│  ┌─────────────  VPC  ───────────────────────────────────────────┐  │
│  │ ┌──────────────────────────┐                                  │  │
│  │ │ Internal NLB              │                                 │  │
│  │ │  (private subnets)        │                                 │  │
│  │ └────────────┬─────────────┘                                  │  │
│  │              ▼                                                 │  │
│  │ ┌──────────────────────────────────────────────────────────┐  │  │
│  │ │ EC2 fleet — OpenTelemetry collector containers            │  │  │
│  │ │  Receivers → Processors → Exporters                       │  │  │
│  │ └────────────┬─────────────────────────────────────────────┘  │  │
│  └──────────────┼─────────────────────────────────────────────────┘  │
│                 │ exporter fan-out                                    │
│                 ▼                                                     │
│  ┌──────────┬──────────┬──────────┬──────────┐                        │
│  │ Grafana  │ X-Ray    │ Honeycomb│ Lightstep│  …                     │
│  │ Cloud    │          │          │          │                        │
│  └──────────┴──────────┴──────────┴──────────┘                        │
└─────────────────────────────────────────────────────────────────────┘

Component roles

Component Role Why this primitive
CloudWatch Metric Streams Source-side push of every CloudWatch metric, sub-minute Inverts the pull failure mode; supports OTel 0.7 / 1.0 / JSON
Amazon Data Firehose Buffer + transform + deliver pipe; managed The only delivery target Metric Streams supports
S3 Mandatory Firehose fallback destination Required by Firehose; zero-cost in steady state
Lambda transform Bridge from public Firehose HTTP to private VPC endpoint Resolves Firehose's "HTTP endpoint must be public" constraint
Internal NLB VPC-internal Layer-4 ingress for the collector fleet Stable internal endpoint, TCP fan-out across collectors
EC2 Hosts collector containers in private subnets Customer-managed compute; no managed-OTel collector service
OTel collector Receive → process → fan out to multiple backends Vendor-neutral central hub

Why this composition (not alternatives)

The pattern's structural choices fall out from the requirements:

  • Why Metric Streams + Firehose, not direct CloudWatch scrape? Pull-based scrape hits API throttling at scale and amplifies cost. Metric Streams pushes once, no per-metric API call. See concepts/push-vs-pull-monitoring.
  • Why Lambda transform instead of Firehose's native HTTP destination? Firehose's HTTP endpoint destination is public-only — "these endpoints must be public — they cannot be private endpoints inside a VPC." Customer's data-privacy requirement forced VPC-internal delivery. See patterns/firehose-lambda-transform-as-vpc-bridge.
  • Why NLB instead of ALB? OTel collector typically speaks HTTP/gRPC over TCP at fixed ports; NLB's Layer-4 model is the cleaner fit and preserves source IP. ALB would also work but pays a higher per-request cost without adding value here.
  • Why a self-hosted collector instead of Amazon Managed Service for Prometheus? AMP is a metrics backend, not a fan-out hub — and the customer wanted vendor-neutral export to Grafana Cloud / Honeycomb / X-Ray simultaneously. Vendor-neutrality is the load- bearing requirement, not just AWS-native ingestion.

Trade-offs vs simpler alternatives

Alternative What you lose What you gain
CloudWatch alarms only Vendor-neutrality, Grafana / Honeycomb / Lightstep integration Simpler, AWS-native, no infra to run
Prometheus + CloudWatch exporter (pull) Sub-minute latency, no API throttling at scale, no per-call cost amplification Familiar Prometheus tooling; in-VPC by default
Metric Streams → Firehose → public OTel-as-a-Service Data-stays-in-VPC property Skip Lambda + NLB; lower complexity
Self-hosted Prometheus + push exporter Native CloudWatch integration; managed Metric Streams primitive Single-vendor monitoring stack

The pattern wins specifically when all three of (push, vendor-neutral, VPC-internal) are required. Drop any one of those and a simpler architecture becomes preferable.

Operational concerns

  • Lambda transform cold-start on critical path — synchronous invocation; a cold start back-pressures Firehose's buffer. Provisioned concurrency mitigates.
  • Firehose buffer + Lambda concurrency tuning — too small a buffer + too few Lambdas = throttle; too large a buffer = delivery latency tail.
  • Collector fleet sizing — collector throughput is bounded by per-instance CPU + RAM; multiple receivers + heavy processors compound. NLB target-group health-checks + ASG-driven autoscaling are the canonical sizing levers.
  • Exporter back-pressure — slow downstream backends back- pressure into the collector's queue, then the collector processor stage, then the receiver — eventually rejecting Lambda pushes back into Firehose's buffer. The collector's memory_limiter processor is the canonical safety valve.
  • Format choice (OTel 1.0 vs JSON at the Metric Streams layer) — OTel 1.0 lets the collector ingest natively; JSON requires the Lambda transform to translate. JSON is simpler Lambda code; OTel 1.0 is leaner end-to-end if no transform is needed.

Hard problems

  • At-least-once duplication — Firehose retries + Lambda retries can produce duplicate metric records under transient failures. Idempotent ingestion in the collector (deduping by metric + timestamp) is non-trivial.
  • Cross-account / cross-region streaming — Metric Streams is per-region; a single OTel collector aggregating across regions requires per-region Firehose + Lambda + cross-region push.
  • Mixed-source aggregation — combining the Metric Streams push with traces + logs from in-cluster OTel SDKs requires the collector to multiplex receivers, which needs careful pipeline config to avoid receiver-blocking-on-slow-exporter.
  • Configuration management for the collector — collector config is YAML; deploying it across an EC2 fleet, validating it before rollout, and rolling back on regression is its own ops surface (typically solved with the same config-as-code primitive the rest of the customer's infra uses).

Seen in

Last updated · 542 distilled / 1,571 read