PATTERN Cited by 1 source
CloudWatch Metric Streams to VPC OpenTelemetry collector¶
A composite reference architecture for push-based, vendor-neutral metric ingestion that keeps all data inside the customer's VPC. Five AWS primitives chained: CloudWatch Metric Streams → Amazon Data Firehose → Lambda transform → internal NLB → OpenTelemetry collector on EC2. The OpenTelemetry collector then fans out to multiple observability backends (Grafana Cloud, AWS X-Ray, Honeycomb, Lightstep, etc.) via its exporter stage.
This pattern is the canonical answer to three intersecting requirements: (1) push-based monitoring to escape pull-based API throttling; (2) vendor-neutral observability via OpenTelemetry's open formats; (3) VPC-internal data path for data-privacy / regulatory reasons.
Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ AWS account / region │
│ │
│ ┌──────────────────────────────┐ │
│ │ CloudWatch │ │
│ │ └─ Metric Streams (JSON) │ │
│ └──────────────┬───────────────┘ │
│ │ push │
│ ▼ │
│ ┌──────────────────────────────┐ ┌─────────────────────────┐ │
│ │ Amazon Data Firehose │───▶│ S3 (fallback / 0-cost) │ │
│ │ (delivery stream) │ └─────────────────────────┘ │
│ └──────────────┬───────────────┘ │
│ │ sync invoke (transform) │
│ ▼ │
│ ┌──────────────────────────────┐ │
│ │ Lambda transform (VPC-attached) │
│ └──────────────┬───────────────┘ │
│ │ HTTP push │
│ ▼ │
│ ┌───────────── VPC ───────────────────────────────────────────┐ │
│ │ ┌──────────────────────────┐ │ │
│ │ │ Internal NLB │ │ │
│ │ │ (private subnets) │ │ │
│ │ └────────────┬─────────────┘ │ │
│ │ ▼ │ │
│ │ ┌──────────────────────────────────────────────────────────┐ │ │
│ │ │ EC2 fleet — OpenTelemetry collector containers │ │ │
│ │ │ Receivers → Processors → Exporters │ │ │
│ │ └────────────┬─────────────────────────────────────────────┘ │ │
│ └──────────────┼─────────────────────────────────────────────────┘ │
│ │ exporter fan-out │
│ ▼ │
│ ┌──────────┬──────────┬──────────┬──────────┐ │
│ │ Grafana │ X-Ray │ Honeycomb│ Lightstep│ … │
│ │ Cloud │ │ │ │ │
│ └──────────┴──────────┴──────────┴──────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Component roles¶
| Component | Role | Why this primitive |
|---|---|---|
| CloudWatch Metric Streams | Source-side push of every CloudWatch metric, sub-minute | Inverts the pull failure mode; supports OTel 0.7 / 1.0 / JSON |
| Amazon Data Firehose | Buffer + transform + deliver pipe; managed | The only delivery target Metric Streams supports |
| S3 | Mandatory Firehose fallback destination | Required by Firehose; zero-cost in steady state |
| Lambda transform | Bridge from public Firehose HTTP to private VPC endpoint | Resolves Firehose's "HTTP endpoint must be public" constraint |
| Internal NLB | VPC-internal Layer-4 ingress for the collector fleet | Stable internal endpoint, TCP fan-out across collectors |
| EC2 | Hosts collector containers in private subnets | Customer-managed compute; no managed-OTel collector service |
| OTel collector | Receive → process → fan out to multiple backends | Vendor-neutral central hub |
Why this composition (not alternatives)¶
The pattern's structural choices fall out from the requirements:
- Why Metric Streams + Firehose, not direct CloudWatch scrape? Pull-based scrape hits API throttling at scale and amplifies cost. Metric Streams pushes once, no per-metric API call. See concepts/push-vs-pull-monitoring.
- Why Lambda transform instead of Firehose's native HTTP destination? Firehose's HTTP endpoint destination is public-only — "these endpoints must be public — they cannot be private endpoints inside a VPC." Customer's data-privacy requirement forced VPC-internal delivery. See patterns/firehose-lambda-transform-as-vpc-bridge.
- Why NLB instead of ALB? OTel collector typically speaks HTTP/gRPC over TCP at fixed ports; NLB's Layer-4 model is the cleaner fit and preserves source IP. ALB would also work but pays a higher per-request cost without adding value here.
- Why a self-hosted collector instead of Amazon Managed Service for Prometheus? AMP is a metrics backend, not a fan-out hub — and the customer wanted vendor-neutral export to Grafana Cloud / Honeycomb / X-Ray simultaneously. Vendor-neutrality is the load- bearing requirement, not just AWS-native ingestion.
Trade-offs vs simpler alternatives¶
| Alternative | What you lose | What you gain |
|---|---|---|
| CloudWatch alarms only | Vendor-neutrality, Grafana / Honeycomb / Lightstep integration | Simpler, AWS-native, no infra to run |
| Prometheus + CloudWatch exporter (pull) | Sub-minute latency, no API throttling at scale, no per-call cost amplification | Familiar Prometheus tooling; in-VPC by default |
| Metric Streams → Firehose → public OTel-as-a-Service | Data-stays-in-VPC property | Skip Lambda + NLB; lower complexity |
| Self-hosted Prometheus + push exporter | Native CloudWatch integration; managed Metric Streams primitive | Single-vendor monitoring stack |
The pattern wins specifically when all three of (push, vendor-neutral, VPC-internal) are required. Drop any one of those and a simpler architecture becomes preferable.
Operational concerns¶
- Lambda transform cold-start on critical path — synchronous invocation; a cold start back-pressures Firehose's buffer. Provisioned concurrency mitigates.
- Firehose buffer + Lambda concurrency tuning — too small a buffer + too few Lambdas = throttle; too large a buffer = delivery latency tail.
- Collector fleet sizing — collector throughput is bounded by per-instance CPU + RAM; multiple receivers + heavy processors compound. NLB target-group health-checks + ASG-driven autoscaling are the canonical sizing levers.
- Exporter back-pressure — slow downstream backends back-
pressure into the collector's queue, then the collector
processor stage, then the receiver — eventually rejecting
Lambda pushes back into Firehose's buffer. The collector's
memory_limiterprocessor is the canonical safety valve. - Format choice (OTel 1.0 vs JSON at the Metric Streams layer) — OTel 1.0 lets the collector ingest natively; JSON requires the Lambda transform to translate. JSON is simpler Lambda code; OTel 1.0 is leaner end-to-end if no transform is needed.
Hard problems¶
- At-least-once duplication — Firehose retries + Lambda retries can produce duplicate metric records under transient failures. Idempotent ingestion in the collector (deduping by metric + timestamp) is non-trivial.
- Cross-account / cross-region streaming — Metric Streams is per-region; a single OTel collector aggregating across regions requires per-region Firehose + Lambda + cross-region push.
- Mixed-source aggregation — combining the Metric Streams push with traces + logs from in-cluster OTel SDKs requires the collector to multiplex receivers, which needs careful pipeline config to avoid receiver-blocking-on-slow-exporter.
- Configuration management for the collector — collector config is YAML; deploying it across an EC2 fleet, validating it before rollout, and rolling back on regression is its own ops surface (typically solved with the same config-as-code primitive the rest of the customer's infra uses).
Seen in¶
- sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda — first canonical wiki home. The customer (unnamed) had a data-privacy requirement that "the metric data and the OpenTelemetry collector" remain VPC-internal, plus a push-based-monitoring requirement driven by Prometheus + AWS CloudWatch exporter throttling under load. The post documents the full five-primitive composition (CloudWatch Metric Streams → Firehose → Lambda transform → internal NLB → EC2-hosted OTel collector → fan-out to Grafana Cloud / X-Ray / Honeycomb / Lightstep) and provides a CloudFormation template + AWS-CLI walkthrough at github.com/aws-samples/sample-cloudwatch-metrics-stream-otel-transformer. No latency / volume / fleet-sizing operational numbers disclosed — this is a reference architecture, not a retrospective.
Related¶
- systems/amazon-cloudwatch-metric-streams — source primitive.
- systems/amazon-data-firehose — delivery substrate.
- systems/aws-lambda — bridge compute.
- systems/aws-nlb — VPC-internal ingress.
- systems/opentelemetry — central-hub framework.
- systems/aws-distro-for-opentelemetry — AWS-bundled OTel build.
- concepts/push-vs-pull-monitoring — driving architectural axis.
- concepts/opentelemetry-collector-three-stage-pipeline — the receiver/processor/exporter abstraction the collector tier provides.
- patterns/firehose-lambda-transform-as-vpc-bridge — the load-bearing building block.