PATTERN Cited by 1 source

CloudWatch Metric Streams to VPC OpenTelemetry collector¶

A composite reference architecture for push-based, vendor-neutral metric ingestion that keeps all data inside the customer's VPC. Five AWS primitives chained: CloudWatch Metric Streams → Amazon Data Firehose → Lambda transform → internal NLB → OpenTelemetry collector on EC2. The OpenTelemetry collector then fans out to multiple observability backends (Grafana Cloud, AWS X-Ray, Honeycomb, Lightstep, etc.) via its exporter stage.

This pattern is the canonical answer to three intersecting requirements: (1) push-based monitoring to escape pull-based API throttling; (2) vendor-neutral observability via OpenTelemetry's open formats; (3) VPC-internal data path for data-privacy / regulatory reasons.

Architecture¶

┌─────────────────────────────────────────────────────────────────────┐
│                      AWS account / region                            │
│                                                                      │
│  ┌──────────────────────────────┐                                    │
│  │ CloudWatch                    │                                    │
│  │  └─ Metric Streams (JSON)     │                                    │
│  └──────────────┬───────────────┘                                    │
│                 │ push                                                │
│                 ▼                                                     │
│  ┌──────────────────────────────┐    ┌─────────────────────────┐     │
│  │ Amazon Data Firehose          │───▶│ S3 (fallback / 0-cost)   │     │
│  │  (delivery stream)            │    └─────────────────────────┘     │
│  └──────────────┬───────────────┘                                    │
│                 │ sync invoke (transform)                            │
│                 ▼                                                     │
│  ┌──────────────────────────────┐                                    │
│  │ Lambda transform (VPC-attached)                                   │
│  └──────────────┬───────────────┘                                    │
│                 │ HTTP push                                           │
│                 ▼                                                     │
│  ┌─────────────  VPC  ───────────────────────────────────────────┐  │
│  │ ┌──────────────────────────┐                                  │  │
│  │ │ Internal NLB              │                                 │  │
│  │ │  (private subnets)        │                                 │  │
│  │ └────────────┬─────────────┘                                  │  │
│  │              ▼                                                 │  │
│  │ ┌──────────────────────────────────────────────────────────┐  │  │
│  │ │ EC2 fleet — OpenTelemetry collector containers            │  │  │
│  │ │  Receivers → Processors → Exporters                       │  │  │
│  │ └────────────┬─────────────────────────────────────────────┘  │  │
│  └──────────────┼─────────────────────────────────────────────────┘  │
│                 │ exporter fan-out                                    │
│                 ▼                                                     │
│  ┌──────────┬──────────┬──────────┬──────────┐                        │
│  │ Grafana  │ X-Ray    │ Honeycomb│ Lightstep│  …                     │
│  │ Cloud    │          │          │          │                        │
│  └──────────┴──────────┴──────────┴──────────┘                        │
└─────────────────────────────────────────────────────────────────────┘

Component roles¶

Component	Role	Why this primitive
CloudWatch Metric Streams	Source-side push of every CloudWatch metric, sub-minute	Inverts the pull failure mode; supports OTel 0.7 / 1.0 / JSON
Amazon Data Firehose	Buffer + transform + deliver pipe; managed	The only delivery target Metric Streams supports
S3	Mandatory Firehose fallback destination	Required by Firehose; zero-cost in steady state
Lambda transform	Bridge from public Firehose HTTP to private VPC endpoint	Resolves Firehose's "HTTP endpoint must be public" constraint
Internal NLB	VPC-internal Layer-4 ingress for the collector fleet	Stable internal endpoint, TCP fan-out across collectors
EC2	Hosts collector containers in private subnets	Customer-managed compute; no managed-OTel collector service
OTel collector	Receive → process → fan out to multiple backends	Vendor-neutral central hub

Why this composition (not alternatives)¶

The pattern's structural choices fall out from the requirements:

Why Metric Streams + Firehose, not direct CloudWatch scrape? Pull-based scrape hits API throttling at scale and amplifies cost. Metric Streams pushes once, no per-metric API call. See concepts/push-vs-pull-monitoring.
Why Lambda transform instead of Firehose's native HTTP destination? Firehose's HTTP endpoint destination is public-only — "these endpoints must be public — they cannot be private endpoints inside a VPC." Customer's data-privacy requirement forced VPC-internal delivery. See patterns/firehose-lambda-transform-as-vpc-bridge.
Why NLB instead of ALB? OTel collector typically speaks HTTP/gRPC over TCP at fixed ports; NLB's Layer-4 model is the cleaner fit and preserves source IP. ALB would also work but pays a higher per-request cost without adding value here.
Why a self-hosted collector instead of Amazon Managed Service for Prometheus? AMP is a metrics backend, not a fan-out hub — and the customer wanted vendor-neutral export to Grafana Cloud / Honeycomb / X-Ray simultaneously. Vendor-neutrality is the load- bearing requirement, not just AWS-native ingestion.

Trade-offs vs simpler alternatives¶

Alternative	What you lose	What you gain
CloudWatch alarms only	Vendor-neutrality, Grafana / Honeycomb / Lightstep integration	Simpler, AWS-native, no infra to run
Prometheus + CloudWatch exporter (pull)	Sub-minute latency, no API throttling at scale, no per-call cost amplification	Familiar Prometheus tooling; in-VPC by default
Metric Streams → Firehose → public OTel-as-a-Service	Data-stays-in-VPC property	Skip Lambda + NLB; lower complexity
Self-hosted Prometheus + push exporter	Native CloudWatch integration; managed Metric Streams primitive	Single-vendor monitoring stack

The pattern wins specifically when all three of (push, vendor-neutral, VPC-internal) are required. Drop any one of those and a simpler architecture becomes preferable.

Operational concerns¶

Lambda transform cold-start on critical path — synchronous invocation; a cold start back-pressures Firehose's buffer. Provisioned concurrency mitigates.
Firehose buffer + Lambda concurrency tuning — too small a buffer + too few Lambdas = throttle; too large a buffer = delivery latency tail.
Collector fleet sizing — collector throughput is bounded by per-instance CPU + RAM; multiple receivers + heavy processors compound. NLB target-group health-checks + ASG-driven autoscaling are the canonical sizing levers.
Exporter back-pressure — slow downstream backends back- pressure into the collector's queue, then the collector processor stage, then the receiver — eventually rejecting Lambda pushes back into Firehose's buffer. The collector's memory_limiter processor is the canonical safety valve.
Format choice (OTel 1.0 vs JSON at the Metric Streams layer) — OTel 1.0 lets the collector ingest natively; JSON requires the Lambda transform to translate. JSON is simpler Lambda code; OTel 1.0 is leaner end-to-end if no transform is needed.

Hard problems¶

At-least-once duplication — Firehose retries + Lambda retries can produce duplicate metric records under transient failures. Idempotent ingestion in the collector (deduping by metric + timestamp) is non-trivial.
Cross-account / cross-region streaming — Metric Streams is per-region; a single OTel collector aggregating across regions requires per-region Firehose + Lambda + cross-region push.
Mixed-source aggregation — combining the Metric Streams push with traces + logs from in-cluster OTel SDKs requires the collector to multiplex receivers, which needs careful pipeline config to avoid receiver-blocking-on-slow-exporter.
Configuration management for the collector — collector config is YAML; deploying it across an EC2 fleet, validating it before rollout, and rolling back on regression is its own ops surface (typically solved with the same config-as-code primitive the rest of the customer's infra uses).

Seen in¶

sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda — first canonical wiki home. The customer (unnamed) had a data-privacy requirement that "the metric data and the OpenTelemetry collector" remain VPC-internal, plus a push-based-monitoring requirement driven by Prometheus + AWS CloudWatch exporter throttling under load. The post documents the full five-primitive composition (CloudWatch Metric Streams → Firehose → Lambda transform → internal NLB → EC2-hosted OTel collector → fan-out to Grafana Cloud / X-Ray / Honeycomb / Lightstep) and provides a CloudFormation template + AWS-CLI walkthrough at github.com/aws-samples/sample-cloudwatch-metrics-stream-otel-transformer. No latency / volume / fleet-sizing operational numbers disclosed — this is a reference architecture, not a retrospective.

systems/amazon-cloudwatch-metric-streams — source primitive.
systems/amazon-data-firehose — delivery substrate.
systems/aws-lambda — bridge compute.
systems/aws-nlb — VPC-internal ingress.
systems/opentelemetry — central-hub framework.
systems/aws-distro-for-opentelemetry — AWS-bundled OTel build.
concepts/push-vs-pull-monitoring — driving architectural axis.
concepts/opentelemetry-collector-three-stage-pipeline — the receiver/processor/exporter abstraction the collector tier provides.
patterns/firehose-lambda-transform-as-vpc-bridge — the load-bearing building block.