CLOUDFLARE 2026-06-12

Scaling Security Insights: how we achieved a 10x increase in global scanning capacity¶

Summary¶

Cloudflare's Security Insights team needed to scale their account/zone scanning throughput from 10 scans/second to 100+ scans/second to enable automatic scanning for all free accounts and double scanning frequency. The system — a scheduler publishing to Apache Kafka, consumed by Go microservice checkers that write results to a Postgres database via an internal API — was bottlenecked by Kafka's in-order consumption constraint, database round-trip overhead, cross-region API latency causing connection pool exhaustion, and spiky scheduling. They achieved >120 scans/second through five targeted architectural fixes without adding Kafka partitions or more infrastructure.

Key Takeaways¶

Batch-parallel consumption within partition ordering — although Kafka enforces per-partition ordering for consumption, nothing prevents consuming a batch of messages and processing them concurrently via goroutines. Trade-off: more re-work on crash, slightly higher memory. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Fast-lane / slow-lane consumer-group split — splitting a single consumer group into two (one skips slow messages, the other handles them) eliminates head-of-line blocking without adding Kafka partitions. Determination of fast-vs-slow is cheap (message metadata inspection). (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Bulk INSERT with hybrid strategy — naive per-row INSERT with ON CONFLICT caused up to 500,000 round trips per API call. COPY into a temp table caused Postgres system-table bloat. The winning hybrid: COPY for large sets (seconds), batched INSERT for small sets (milliseconds). (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Active-passive API follows primary database — running the API active-active across Portland and Amsterdam while the Postgres primary lived only in Portland meant 50ms+ RTT per query from Amsterdam, exhausting client connection pools. Switching to active-passive (API collocated with primary) eliminated latency and timeouts overnight. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Connection pool exhaustion caused by cross-region latency — high per-query latency from Amsterdam kept connections occupied longer, draining the client-side pool. Average API call: 10ms in Portland vs. ~3 seconds in Amsterdam. Load-balanced persistent connections meant exactly half the Kafka partitions were starved. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Adaptive rate-limited scheduling — a naive fixed rate limit can't accommodate growth (more accounts → overdue scans). The rate limit is recomputed every 30 minutes from current account/zone counts and per-tier scanning intervals, plus a buffer factor for downtime recovery. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Schedule zones independently of accounts — large accounts with many zones cascaded all zone scans into a single burst, saturating Kafka partitions. Decoupling zone scheduling from account scheduling and randomizing last_scheduled_at across existing entities eliminated spikes. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Understand before you scale — the team's core lesson: deeply understand existing system behaviour (logs, metrics, SQL queries, latency distributions) before adding resources. All five fixes were code/architecture changes, not capacity adds. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)

Operational Numbers¶

Metric	Before	After
Scan throughput	~10 scans/sec	>120 scans/sec peak
Target improvement	—	10x
Kafka partitions per checker	30	30 (unchanged)
API latency (Portland)	~10 ms	~10 ms
API latency (Amsterdam)	~3,000 ms	eliminated (active-passive)
Max insights per API call	500,000	500,000 (now fast)
Rate limit recalculation	—	every 30 min

Architecture¶

┌───────────┐      ┌───────────┐      ┌──────────────┐      ┌──────────┐
│ Scheduler │─────▶│   Kafka   │─────▶│ Go Checkers  │─────▶│ API      │──▶ Postgres
│ (adaptive │      │ (30 parts)│      │ (fast + slow │      │ (active- │   (Portland)
│  rate lim)│      │           │      │  lanes, batch│      │  passive)│
└───────────┘      └───────────┘      │  goroutines) │      └──────────┘
                                      └──────────────┘

Caveats¶

The article does not discuss exactly-once semantics or how duplicate scans are handled when a batch crashes mid-processing.
The hybrid bulk-insert strategy's threshold between COPY and batched INSERT is not disclosed.
Adaptive rate limiting assumes relatively uniform scan cost; highly variable scan durations could still cause queue buildup.

Source¶

systems/kafka — the streaming platform at the center of this architecture
concepts/kafka-partition — the partition-ordering constraint that motivates the batch-parallel and lane-split patterns
concepts/consumer-group — one consumer per partition per group; the split into fast/slow creates two groups
concepts/connection-pool-exhaustion — the failure mode triggered by cross-region latency
concepts/head-of-line-blocking — slow messages blocking fast messages within a partition
concepts/backpressure — the scheduler's adaptive rate limit is a form of admission control
patterns/fast-lane-slow-lane-consumer-split — new pattern from this article
patterns/adaptive-rate-limited-scheduling — new pattern from this article
patterns/batch-goroutine-parallel-consumption — new pattern from this article