Scaling Security Insights: how we achieved a 10x increase in global scanning capacity¶
Summary¶
Cloudflare's Security Insights team needed to scale their account/zone scanning throughput from 10 scans/second to 100+ scans/second to enable automatic scanning for all free accounts and double scanning frequency. The system — a scheduler publishing to Apache Kafka, consumed by Go microservice checkers that write results to a Postgres database via an internal API — was bottlenecked by Kafka's in-order consumption constraint, database round-trip overhead, cross-region API latency causing connection pool exhaustion, and spiky scheduling. They achieved >120 scans/second through five targeted architectural fixes without adding Kafka partitions or more infrastructure.
Key Takeaways¶
-
Batch-parallel consumption within partition ordering — although Kafka enforces per-partition ordering for consumption, nothing prevents consuming a batch of messages and processing them concurrently via goroutines. Trade-off: more re-work on crash, slightly higher memory. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
-
Fast-lane / slow-lane consumer-group split — splitting a single consumer group into two (one skips slow messages, the other handles them) eliminates head-of-line blocking without adding Kafka partitions. Determination of fast-vs-slow is cheap (message metadata inspection). (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
-
Bulk INSERT with hybrid strategy — naive per-row INSERT with ON CONFLICT caused up to 500,000 round trips per API call. COPY into a temp table caused Postgres system-table bloat. The winning hybrid: COPY for large sets (seconds), batched INSERT for small sets (milliseconds). (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
-
Active-passive API follows primary database — running the API active-active across Portland and Amsterdam while the Postgres primary lived only in Portland meant 50ms+ RTT per query from Amsterdam, exhausting client connection pools. Switching to active-passive (API collocated with primary) eliminated latency and timeouts overnight. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
-
Connection pool exhaustion caused by cross-region latency — high per-query latency from Amsterdam kept connections occupied longer, draining the client-side pool. Average API call: 10ms in Portland vs. ~3 seconds in Amsterdam. Load-balanced persistent connections meant exactly half the Kafka partitions were starved. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
-
Adaptive rate-limited scheduling — a naive fixed rate limit can't accommodate growth (more accounts → overdue scans). The rate limit is recomputed every 30 minutes from current account/zone counts and per-tier scanning intervals, plus a buffer factor for downtime recovery. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
-
Schedule zones independently of accounts — large accounts with many zones cascaded all zone scans into a single burst, saturating Kafka partitions. Decoupling zone scheduling from account scheduling and randomizing
last_scheduled_atacross existing entities eliminated spikes. (Source: sources/2026-06-12-cloudflare-scaling-security-insights) -
Understand before you scale — the team's core lesson: deeply understand existing system behaviour (logs, metrics, SQL queries, latency distributions) before adding resources. All five fixes were code/architecture changes, not capacity adds. (Source: sources/2026-06-12-cloudflare-scaling-security-insights)
Operational Numbers¶
| Metric | Before | After |
|---|---|---|
| Scan throughput | ~10 scans/sec | >120 scans/sec peak |
| Target improvement | — | 10x |
| Kafka partitions per checker | 30 | 30 (unchanged) |
| API latency (Portland) | ~10 ms | ~10 ms |
| API latency (Amsterdam) | ~3,000 ms | eliminated (active-passive) |
| Max insights per API call | 500,000 | 500,000 (now fast) |
| Rate limit recalculation | — | every 30 min |
Architecture¶
┌───────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────┐
│ Scheduler │─────▶│ Kafka │─────▶│ Go Checkers │─────▶│ API │──▶ Postgres
│ (adaptive │ │ (30 parts)│ │ (fast + slow │ │ (active- │ (Portland)
│ rate lim)│ │ │ │ lanes, batch│ │ passive)│
└───────────┘ └───────────┘ │ goroutines) │ └──────────┘
└──────────────┘
Caveats¶
- The article does not discuss exactly-once semantics or how duplicate scans are handled when a batch crashes mid-processing.
- The hybrid bulk-insert strategy's threshold between COPY and batched INSERT is not disclosed.
- Adaptive rate limiting assumes relatively uniform scan cost; highly variable scan durations could still cause queue buildup.
Source¶
- Original: https://blog.cloudflare.com/scaling-security-scans/
- Raw markdown:
raw/cloudflare/2026-06-12-scaling-security-insights-how-we-achieved-a-10x-increase-in-9fe2cb91.md
Related¶
- systems/kafka — the streaming platform at the center of this architecture
- concepts/kafka-partition — the partition-ordering constraint that motivates the batch-parallel and lane-split patterns
- concepts/consumer-group — one consumer per partition per group; the split into fast/slow creates two groups
- concepts/connection-pool-exhaustion — the failure mode triggered by cross-region latency
- concepts/head-of-line-blocking — slow messages blocking fast messages within a partition
- concepts/backpressure — the scheduler's adaptive rate limit is a form of admission control
- patterns/fast-lane-slow-lane-consumer-split — new pattern from this article
- patterns/adaptive-rate-limited-scheduling — new pattern from this article
- patterns/batch-goroutine-parallel-consumption — new pattern from this article