Skip to content

AWS 2026-05-13 Tier 1

Read original ↗

AWS — Streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda

Summary

AWS Architecture Blog post (2026-05-13) describing a customer solution that bridges Amazon CloudWatch metrics into a self-hosted OpenTelemetry collector running inside the customer's own VPC. The architecture swaps a Prometheus-based pull pipeline (which hit API throttling and missed metrics under load) for a push pipeline built from four AWS primitives — CloudWatch Metric StreamsAmazon Data FirehoseLambdaNetwork Load Balancer → OpenTelemetry collector on EC2 — and is anchored by one mechanism: the Lambda transform function as a bridge between Firehose's public-HTTP-only delivery and a private VPC endpoint. Because Amazon Data Firehose's HTTP endpoint destination only accepts public endpoints (it cannot deliver into a private VPC), the team uses Firehose's data-transformation feature with Lambda — synchronously invoked per record batch, the Lambda function pushes the metric payload through an internal NLB to OpenTelemetry collectors running on EC2 inside the VPC. The post's secondary contribution is naming the OpenTelemetry collector's three-stage pipeline — receivers → processors → exporters — as the central hub abstraction that lets the customer fan metrics out to multiple downstream backends (Grafana Cloud, AWS X-Ray, Lightstep, Honeycomb) without code changes. The post concludes that the push-based approach delivers sub-minute latency for real-time alerting while eliminating third-party licensing fees and removing the API-throttling failure mode of the prior Prometheus + CloudWatch-exporter pull pipeline.

Key takeaways

  • Pull-based monitoring throttled at scale; push-based fixed it. Verbatim: "Our customer's current monitoring solution with Prometheus and Amazon CloudWatch exporter using a pull-based approach resulted in higher API throttling. This caused metric loss and created gaps in observability data for business-critical systems. The frequent polling approach in this model also resulted in higher costs from API calls. This polling solution did not satisfy their requirement of sub-minute latency for real-time alerting." The architectural failure mode of pull- based observability at scale is named explicitly: CloudWatch-exporter polling generates per-metric API calls that hit throttling caps and drop data. Push-based metric streams, by contrast, deliver near real-time without API-call amplification — "reducing frequent polling and API calls, enabling near real- time data transmission, and potentially eliminating licensing costs from using third-party solutions." See concepts/push-vs-pull-monitoring. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • Firehose's HTTP-endpoint destination is public-only — Lambda is the bridge. The structural constraint that drives the whole architecture is named explicitly: "Amazon Data Firehose natively supports data delivery to HTTP endpoints, but these endpoints must be public — they cannot be private endpoints inside a VPC. To overcome this limitation, we use the Amazon Data Firehose transform configuration, that invokes a Lambda function synchronously, which then securely pushes the metrics through the NLB endpoint to the collector running within the VPC." The customer required "the metric data and the OpenTelemetry collector to be within their VPC" for data- privacy reasons; Lambda fills the public-to-private gap by running inside the VPC (or with VPC-attached networking) and performing the push to an internal NLB on Firehose's behalf. See patterns/firehose-lambda-transform-as-vpc-bridge. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • Firehose data transformation is a synchronous, batched Lambda invocation. "Amazon Data Firehose buffers incoming data before synchronously invoking the Lambda function that streams the metrics to the internal HTTP endpoint." The synchronous invocation contract means Firehose waits for Lambda to confirm delivery before checkpointing — failures back-pressure into Firehose's at-least-once buffer. This is canonically a different Lambda role than the typical async-event-source-mapping shape (Kinesis / SQS / DynamoDB Streams): for Firehose transforms, the return value is the delivery payload, not just success/fail. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • OpenTelemetry collector is the three-stage central hub. "The OpenTelemetry collector operates through three primary components that work together in a processing flow: Receivers accept data in specified formats (like Prometheus or OpenTelemetry Protocol (OTLP)) and translate it into OpenTelemetry's internal format; Processors manipulate and enrich the data as it flows through (filtering unnecessary data, batching for performance, transforming to mask sensitive information, or adding metadata like Kubernetes attributes); and Exporters send the processed data to destination backends such as Grafana Cloud, AWS X-Ray, Lightstep or Honeycomb." The receiver-processor-exporter triad is what turns the collector into a vendor-agnostic central hub: receivers normalise to the internal model, exporters fan out to N backends, and the in-between processor stage carries filtering, batching, and enrichment. See concepts/opentelemetry-collector-three-stage-pipeline. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • NLB as the VPC ingress for the OpenTelemetry collector fleet. Inside the VPC, the OpenTelemetry collector runs as a container on EC2 instances in private subnets; an internal Network Load Balancer distributes TCP traffic across the collector instances. "The NLB distributes TCP traffic to the OpenTelemetry collectors running on EC2 Instances in the internal subnet within the VPC." This is the VPC-internal sibling of the public-NLB-as-static-ingress pattern already canonicalised at systems/aws-nlb: same primitive (Layer-4 TCP load balancer with stable IP), different side of the boundary (here, internal subnet). (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • CloudWatch Metric Streams support OTel 0.7 / 1.0 and JSON output formats. "With CloudWatch Metric Streams, you can stream metrics in OpenTelemetry 0.7, 1.0, and JSON formats. This architecture uses JSON format as the stream output." The native OTel-format support is significant: streams can in principle deliver directly to an OTel-format-receiving endpoint without any transformation. The architecture in this post chooses JSON instead, presumably because the Lambda transform inserts itself between Firehose and the collector and the function's logic is simpler over a self-described JSON envelope. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • S3 is a redundant, no-cost backup destination. "The S3 bucket is a redundant destination for the CloudWatch Streams. Because our Lambda transform function sends the data directly to OpenTelemetry endpoint, no metrics are sent to the S3 destination, and it does not incur any cost." Firehose requires a destination configuration regardless; S3 is wired in as the failover sink but never used in the steady state. This is the canonical Firehose deployment shape — every delivery stream nominally writes to S3, even when the primary destination is something else. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

  • Push-architecture benefits enumerated explicitly. The post enumerates the architectural advantages of push-based observability over pull: (1) "Event-driven architecture: The push approach transmits data in near real-time by triggering collection based on events, not periodic polling"; (2) "Cost efficiency: Push models are significantly more cost-effective than pull models. Instead of continuously scanning large datasets, systems only process and transmit data when relevant events occur, reducing both computational overhead and data transfer costs"; (3) "Scalability: The OpenTelemetry collector serves as a central hub that can scale horizontally to handle varying traffic volumes while providing at-least-once delivery guarantees with automatic retry mechanisms"; (4) "No licensing costs: The Apache 2.0 license is free and royalty- free"; (5) "No vendor lock-in: The permissive nature of Apache 2.0 means you're not tied to any specific vendor's implementation or support model." The Apache-2.0 + vendor-agnostic angle is the wedge against expensive third-party SaaS observability platforms. (Source: sources/2026-05-13-aws-streaming-cloudwatch-metrics-to-vpc-based-opentelemetry-collectors-using-lambda)

Systems extracted

  • systems/amazon-cloudwatch-metric-streams — CloudWatch's push-based metric streaming primitive; supports OTel 0.7 / 1.0 / JSON output formats; sub-minute latency; targets Firehose delivery streams. First canonical wiki naming.
  • systems/amazon-data-firehose — fully managed capture/transform/deliver service for streaming data; Lambda-based data transformation feature is load-bearing in this architecture; HTTP endpoint destination is public-only. First canonical wiki naming. (Formerly called Amazon Kinesis Data Firehose; rebranded to Amazon Data Firehose in February 2024.)
  • systems/aws-distro-for-opentelemetry — AWS's open-source distribution of OpenTelemetry; named in passing as the recommended on-ramp for AWS-native OTel adoption. First canonical wiki naming.
  • systems/aws-cloudwatch — extended: metric-streams primitive added to the page's surface inventory.
  • systems/aws-lambda — extended: Firehose-transform usage pattern (synchronous, batched, return-value-is-the-delivery- payload) added to the seen-in roster.
  • systems/aws-nlb — extended: internal-subnet ingress for VPC-private OTel collector fleet (TCP Layer-4 across EC2-hosted collectors).
  • systems/opentelemetry — extended: collector-internals disclosure (receiver / processor / exporter three-stage pipeline) and the VPC-self-host deployment shape.
  • systems/amazon-ec2 — collector-fleet substrate; runs the OTel collector containers in private subnets.
  • systems/aws-s3 — redundant Firehose destination (zero-cost in this architecture because no metrics are actually delivered to S3).
  • systems/prometheus — named as the prior pull-based pipeline component that hit API throttling.
  • systems/grafana-cloud, systems/honeycomb, systems/aws-x-ray — named exporter-destination examples showing the OTel-collector-as-fan-out role.

Concepts extracted

  • concepts/push-vs-pull-monitoring — the observability- altitude pull-vs-push trade-off; pull (Prometheus shape) gives query-frequency control but generates per-metric API calls and hits throttling at scale; push (Metric Streams shape) is event-driven, sub-minute, and avoids the API-call amplification failure mode. First canonical wiki home for the observability altitude. Distinct from concepts/pull-vs-push-streams which is the JS streams API altitude.
  • concepts/opentelemetry-collector-three-stage-pipeline — receivers → processors → exporters; the load-bearing internal abstraction that turns the OTel collector into a vendor-agnostic central hub. First canonical wiki home.
  • concepts/observability — extended; this source contributes the push-based metric-streams branch of the AWS observability surface.
  • concepts/api-throttling — extended; named as the failure mode of pull-based monitoring at scale.
  • concepts/vendor-lock-in — extended; OpenTelemetry's Apache-2.0 + open-format properties are explicitly named as the lock-in escape hatch.

Patterns extracted

  • patterns/firehose-lambda-transform-as-vpc-bridge — the load-bearing pattern of this post: use Firehose's data-transformation hook (synchronous Lambda invocation per record batch) as a bridge from Firehose's public-only HTTP delivery into a private VPC endpoint. First canonical wiki home. Generalises beyond CloudWatch metrics — any case where you want Firehose-shaped streaming delivery into a VPC-internal HTTP service.
  • patterns/cloudwatch-metric-stream-to-vpc-otel-collector — the composite reference architecture: CloudWatch Metric Streams → Firehose (with S3 redundant sink) → Lambda transform → internal NLB → OTel collector on EC2 → fan-out to Grafana Cloud / X-Ray / Honeycomb / etc. First canonical wiki home.

Operational numbers

  • "sub-minute latency for real-time alerting" — the customer requirement that drove the rejection of the pull-based Prometheus solution. The post does not disclose the actual measured end-to-end latency of the new push pipeline, only that it satisfies this bound.
  • "at-least-once delivery guarantees with automatic retry mechanisms" — the OpenTelemetry collector's delivery contract, cited as a scalability property.
  • No throughput / volume / cost numbers disclosed. The customer is not named, the metric volume is not disclosed, the per-region collector-fleet sizing is not disclosed. The post is a reference-architecture description, not an operational retrospective.

Caveats and what the post doesn't say

  • No latency numbers. Sub-minute is asserted as the requirement and as the satisfied outcome, but the actual end- to-end p50 / p99 from metric-emission to OTel-collector ingestion is not disclosed.
  • No collector-fleet sizing or scaling discussion. The collector "scales horizontally" is asserted; the scaling signal, the sizing math, and the cross-AZ topology are not disclosed.
  • No Lambda concurrency / cold-start discussion. The Firehose-transform Lambda is the synchronous critical path per record batch; cold-start latency on the synchronous- invocation path is plausibly load-bearing but not named.
  • No discussion of Firehose's at-least-once buffer behaviour on Lambda failures. Synchronous-invocation failures back-pressure into Firehose's buffer; the back-pressure envelope and the eventual S3-fallback behaviour aren't enumerated.
  • No security discussion of the metric-data-in-Lambda hop. The customer's data-privacy requirement was that "the metric data and the OpenTelemetry collector" be VPC-internal — but the Lambda is the synchronous bridge; whether the Lambda runs inside the VPC or just has VPC attachments isn't specified.
  • JSON-format choice rationale not made explicit. Metric Streams support OTel 0.7 / 1.0 / JSON; the post chooses JSON but doesn't say why over native OTel 1.0 (which would in principle let the collector ingest directly without the Lambda transform translating).
  • No vendor-lock-in cost numbers. "Eliminating third-party licensing fees" is asserted; the actual licence-fee delta is not quantified.
  • The post is a reference architecture + CloudFormation walkthrough, not a production retrospective. This source's signal is the architectural pattern (Firehose-Lambda bridge, push-vs-pull monitoring rationale, OTel collector internals) rather than measured production outcomes.

Source

Last updated · 542 distilled / 1,571 read