Skip to content

CLOUDFLARE 2026-06-12

Read original ↗

Scaling Security Insights: how we achieved a 10x increase in global scanning capacity

Summary

Cloudflare's Security Insights team needed to scale their account/zone scanning throughput from 10 scans/second to 100+ scans/second to enable automatic scanning for all free accounts and double scanning frequency. The system — a scheduler publishing to Apache Kafka, consumed by Go microservice checkers that write results to a Postgres database via an internal API — was bottlenecked by Kafka's in-order consumption constraint, database round-trip overhead, cross-region API latency causing connection pool exhaustion, and spiky scheduling. They achieved >120 scans/second through five targeted architectural fixes without adding Kafka partitions or more infrastructure.

Key Takeaways

  1. Batch-parallel consumption within partition ordering — although Kafka enforces per-partition ordering for consumption, nothing prevents consuming a batch of messages and processing them concurrently via goroutines. Trade-off: more re-work on crash, slightly higher memory. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  2. Fast-lane / slow-lane consumer-group split — splitting a single consumer group into two (one skips slow messages, the other handles them) eliminates head-of-line blocking without adding Kafka partitions. Determination of fast-vs-slow is cheap (message metadata inspection). (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  3. Bulk INSERT with hybrid strategy — naive per-row INSERT with ON CONFLICT caused up to 500,000 round trips per API call. COPY into a temp table caused Postgres system-table bloat. The winning hybrid: COPY for large sets (seconds), batched INSERT for small sets (milliseconds). (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  4. Active-passive API follows primary database — running the API active-active across Portland and Amsterdam while the Postgres primary lived only in Portland meant 50ms+ RTT per query from Amsterdam, exhausting client connection pools. Switching to active-passive (API collocated with primary) eliminated latency and timeouts overnight. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  5. Connection pool exhaustion caused by cross-region latency — high per-query latency from Amsterdam kept connections occupied longer, draining the client-side pool. Average API call: 10ms in Portland vs. ~3 seconds in Amsterdam. Load-balanced persistent connections meant exactly half the Kafka partitions were starved. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  6. Adaptive rate-limited scheduling — a naive fixed rate limit can't accommodate growth (more accounts → overdue scans). The rate limit is recomputed every 30 minutes from current account/zone counts and per-tier scanning intervals, plus a buffer factor for downtime recovery. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  7. Schedule zones independently of accounts — large accounts with many zones cascaded all zone scans into a single burst, saturating Kafka partitions. Decoupling zone scheduling from account scheduling and randomizing last_scheduled_at across existing entities eliminated spikes. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

  8. Understand before you scale — the team's core lesson: deeply understand existing system behaviour (logs, metrics, SQL queries, latency distributions) before adding resources. All five fixes were code/architecture changes, not capacity adds. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

Operational Numbers

Metric Before After
Scan throughput ~10 scans/sec >120 scans/sec peak
Target improvement 10x
Kafka partitions per checker 30 30 (unchanged)
API latency (Portland) ~10 ms ~10 ms
API latency (Amsterdam) ~3,000 ms eliminated (active-passive)
Max insights per API call 500,000 500,000 (now fast)
Rate limit recalculation every 30 min

Architecture

┌───────────┐      ┌───────────┐      ┌──────────────┐      ┌──────────┐
│ Scheduler │─────▶│   Kafka   │─────▶│ Go Checkers  │─────▶│ API      │──▶ Postgres
│ (adaptive │      │ (30 parts)│      │ (fast + slow │      │ (active- │   (Portland)
│  rate lim)│      │           │      │  lanes, batch│      │  passive)│
└───────────┘      └───────────┘      │  goroutines) │      └──────────┘
                                      └──────────────┘

Caveats

  • The article does not discuss exactly-once semantics or how duplicate scans are handled when a batch crashes mid-processing.
  • The hybrid bulk-insert strategy's threshold between COPY and batched INSERT is not disclosed.
  • Adaptive rate limiting assumes relatively uniform scan cost; highly variable scan durations could still cause queue buildup.

Source

Last updated · 542 distilled / 1,571 read