Skip to content

SYSTEM Cited by 1 source

Snap Snapchat-on-AWS architecture

Snap's Snapchat backend runs almost entirely on AWS, at scale disclosed at AWS re:Invent 2022 and summarized in the High Scalability Dec-2022 roundup.

Scale (disclosed)

  • 300M+ daily active users
  • 5B+ snaps/day
  • 10M QPS
  • 400 TB stored in DynamoDB, with nightly scans running at ~2 billion rows/minute (friend suggestions + ephemeral-data deletion)
  • 900+ EKS clusters × 1000+ instances per cluster

Send path

client (iOS/Android)
  ├──> GW (Gateway service, on EKS)
  │     └──> MEDIA service
  │           └──> CloudFront + S3
  │                 (persist media close to recipient)
  └──> MCS (Core Orchestration Service)
        ├──> Friend Graph service   (permission check)
        └──> SnapDB                  (metadata)

SnapDB is Snap's in-house database built on top of DynamoDB as its backend storage. It adds:

  • transactions,
  • TTL handling,
  • an efficient ephemeral-data + state-synchronization model on top of DynamoDB's native primitives.

The cost-control dimension is explicit in the talk: SnapDB's abstractions over DynamoDB are "what helps control costs" at 400 TB + 2B rows/min-scan load.

Receive path (latency-sensitive)

sender's MCS write
  ──> MCS looks up recipient's persistent connection in ElastiCache
  ──> forward message via connection-owning server
  ──> client retrieves media by media-ID from CloudFront

The architecture migration to this design reported P50 latency reduction of 24% vs. the predecessor path.

Cost-optimization levers

  • Auto-scaling (EKS-level) keeps compute aligned with the send/receive request rate.
  • Instance-type optimization — explicit migration to Graviton ARM-based EC2 for the dominant services, with CPU pricing below comparable x86 SKUs.
  • SnapDB abstraction over DynamoDB — allows Snap to amortize per-partition hot-path reads into ElastiCache persistent connections instead of DynamoDB GetItems.

Why it shows up on this wiki

Canonical example of the DynamoDB-as-scale-out-OLTP pattern at hyperscale, and of EKS + DynamoDB + CloudFront + ElastiCache as a complete architecture for "millions-of-QPS ephemeral messaging". Also a counterpoint to the Twitter / Roblox bare-metal-is-cheaper narrative circulating in the same period (sources/2022-12-02-highscalability-stuff-the-internet-says-on-scalability-for-december-2nd-2022): Snap publicly argued the cloud-native architecture is the cost-control strategy at their scale, with specific citation to Graviton optimization and the ElastiCache hot-connection lookup as the latency wins.

Seen in

Last updated · 319 distilled / 1,201 read