Skip to content

DROPBOX 2025-08-08 Tier 2

Read original ↗

Seventh-generation server hardware at Dropbox: our most efficient and capable architecture yet

Summary

Dropbox's seventh-generation in-house server hardware — replacing the 2020-era sixth-gen Cartman platform — rolled out across five named tiers: Crush (compute), Dexter (database), Sonic (storage), Gumby (mixed-workload GPU), and Godzilla (dense multi-GPU). The refresh is built around three forcing functions: embrace the 2024-era step changes in CPU core counts, 200/400G networking, and HDD areal density; co-develop with suppliers instead of buying off-the-shelf; and bring software teams into hardware decisions at the requirements stage. The headline moves are a 48-core → 84-core AMD EPYC (Rome → Genoa) CPU swap with DDR4 → DDR5 and 25G → 100G NIC (Crush: ~40% SPECintrate gain, +75% cores/socket, 2× RAM/server, same 1U "pizza box", 46 servers/rack); a dual-socket → single-socket Dexter shift with +30% IPC and base clock 2.1 → 3.25 GHz driving up to 3.57× less replication lag on Dynovault and Edgestore; a vibration- and acoustically-tuned storage chassis co-designed with drive vendors that enabled early adoption of Western Digital Ultrastar HC690 32 TB 11-platter SMR drives (~10%+ capacity bump/gen); a SAS topology rework to hit >200 Gbps per chassis vs an internal floor of 30 Gbps/PB; and, at the facility level, 2 → 4 PDUs per rack (reusing existing busways + new receptacles) to lift the per-rack power envelope from 15 kW to >16 kW real-world draw and open headroom for future accelerators. GPUs arrive as two new tiers: Gumby (Crush-derived, 75–600 W TDP envelope, HHHL + FHFL PCIe, for video transcoding / embeddings / inference) and Godzilla (up to 8 interconnected GPUs, for LLM fine-tuning / high-throughput ML training). Dropbox operates roughly tens of thousands of servers with millions of drives, has grown from 40 PB (2012) → 600 PB (2016) → "exabyte era", runs >99% of its storage fleet on SMR, and owns >90% of it in self-operated datacenters since the 2015 Magic Pocket migration. Looking forward, the post names HAMR (heat-assisted magnetic recording) and liquid cooling as the next technology steps.

Key takeaways

  1. Hardware generations are a forcing function for software co-design, not the other way around. Every section of the post names which software team's workload shaped which hardware decision: Dynovault and Edgestore drove the dual-→single-socket Dexter shift (replication lag dominates their tail); containerized services drove Crush's core-count doubling (bin-packing efficiency); systems/dropbox-dash and video processing drove Gumby + Godzilla existing at all; storage software drove the 30 Gbps/PB → 200 Gbps/chassis SAS-topology rework. Stated explicitly: "we weren't just designing servers, but building platforms that elevated our services." Formalized here as concepts/hardware-software-codesign. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  2. Supplier co-development is a strategic lever, not a procurement detail. Across storage (vibration/acoustic chassis), compute (firmware tuning, heatsink/airflow), and the SMR 32 TB drive (first-mover on Ultrastar HC690), Dropbox's story is the same: give suppliers your workload, get early access and firmware customization back. This is a different primitive from "buy off the shelf" and a different primitive from "build it ourselves": it's a long-term co-investment that converts supplier-roadmap position into earlier hardware capability. Formalized as patterns/supplier-codevelopment. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  3. Thermals and power, not silicon, are the new bottlenecks. Quoted near-verbatim: "No matter where we looked — compute, storage, or GPU platforms — one thing was clear: power demands are going up." Dropbox's solution was to cap processor TDP per server so they could pack maximum cores into the existing rack power envelope, model real-world draw not nameplate (nameplate overestimates), and — when real-world still exceeded 15 kW/rack — double PDUs from 2 → 4 using existing busways + new receptacles rather than rebuild the facility. Power consumption per petabyte and per core still decreased. Formalized as concepts/rack-level-power-density and patterns/pdu-doubling-for-power-headroom. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  4. Higher-capacity drives tighten the acoustic/vibration envelope, not just the IOPS-per-TB envelope. The concepts/hard-drive-physics framing from the 2025 Warfield/S3 post argues capacity-per-drive scales exponentially while IOPS-per-drive stays roughly constant (~120 IOPS/drive, flat since 2006). Dropbox adds a second structural constraint: as drives hit 30+ TB, the read/write head's nanometer precision leaves vanishing margin against the vibration of 10k-RPM fans packed into a denser chassis. Vibration induces position error signal (PES) events; worst case, a write fault → drive retry → latency spike + IOPS drop. Meanwhile drives age fastest above ~40 °C, so you can't just slow the fans. The co-developed chassis explicitly trades fan-curve tuning + airflow redirection + acoustic damping against this axis. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  5. SMR is now the dominant format of Dropbox's storage fleet — >99%. This is a major industry data point: shingled magnetic recording started as an experimental tier for cold data and has, at Dropbox specifically, eaten the generalist storage workload entirely. The post cites their 2022 four-years-of-SMR retrospective which charted the 25% → 99% migration. SMR's higher density is what made the 32 TB Ultrastar HC690 viable; SMR is also what narrows vibration tolerance, linking this takeaway directly to (4). (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  6. Compute and database platforms converged onto a single system vendor platform. Early in Crush/Dexter design Dropbox realized the requirements overlapped enough to reuse one vendor platform for both tiers, simplifying components, firmware, drivers, and OS updates. Dexter differentiates via a single-socket (vs Crush's dual-socket) SKU — eliminating inter-socket communication latency for databases where replication lag is the dominant tail driver — while sharing everything else. A consolidation move framed as operational-complexity reduction at fleet scale, not as a hardware-architecture novelty. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  7. Storage throughput engineering is a bandwidth-per-capacity problem, not a per-drive problem. Internal floor: 30 Gbps per PB of data. Expected future systems: >100 Gbps per PB. Design target: >200 Gbps per chassis. This inverts the usual per-drive IOPS/throughput framing; Dropbox cares about whether the aggregate drives in a chassis can deliver proportional bandwidth as capacity climbs, which means the SAS topology (how drives attach to the HBA/expander) becomes the scaling axis, not the drive interface per se. Paired with a new 400G-ready datacenter fabric at the network side (see this earlier post). (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  8. GPU tiers are split into "flexibility" and "density". Gumby is Crush + PCIe GPU slots with an intentionally wide TDP envelope (75–600 W) and both HHHL and FHFL form factors — optimized for mixed inference / embeddings / transcoding workloads that vary widely in accelerator sizing. Godzilla is dense multi-GPU (up to 8, interconnected) for LLM training and fine-tuning. The split encodes a general design principle: accelerator platforms should be planned as product tiers keyed to workload shape, not as a single "GPU server" SKU. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  9. "Real-world modeling beats nameplate budgeting" is an actionable methodology. Dropbox models actual server draw (~16 kW/cabinet under the new workload mix) rather than the manufacturer nameplate (routinely overestimates). That number is what triggered the 4-PDU move. Generalizes: any capacity-planning step that consumes a nameplate max as a hard budget number systematically over-provisions facility power; a workload-shape-aware model unlocks 10–20% headroom. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

  10. The next-gen roadmap names two forcing functions: HAMR and liquid cooling. Heat-assisted magnetic recording will push areal density further but will tighten the acoustic/thermal envelope still more — reinforcing the co-developed-chassis direction. Liquid cooling moves from "niche" to "necessity" as compute densities climb past the ~600 W TDP point Gumby already supports. Both signal that facility-level primitives (cooling medium, power density, rack form factor) become first-class design variables for future generations, not just variables downstream of chip choice. (Source: sources/2025-08-08-dropbox-seventh-generation-server-hardware)

Named platforms and numbers

Tier Workload Key spec changes vs gen-6 Cited numbers
Crush Compute AMD EPYC 7642 Rome 48c → EPYC 9634 Genoa 84c; DDR4 256 GB → DDR5 512 GB; 25G → 100G NIC; NVMe gen5; same 1U chassis, 46 servers/rack +75% cores/socket, 2× RAM, ~40% SPECintrate gain
Dexter Database Same core count as gen-6; dual-socket → single-socket; +30% IPC; 2.1 → 3.25 GHz base clock Up to 3.57× less replication lag on Dynovault and Edgestore
Sonic Storage Co-developed chassis with vibration/acoustic damping, redirected-airflow fan design, SAS topology rework; first-mover on Ultrastar HC690 32 TB SMR (11 platters, 3.5") >200 Gbps/chassis (design target); >10% capacity/gen
Gumby GPU (mixed) Crush-based + PCIe GPU slots 75–600 W TDP envelope; HHHL + FHFL form factors
Godzilla GPU (dense) Up to 8 interconnected GPUs

Facility-level: - Per-rack power: 15 kW → ~16+ kW real-world draw supported; PDUs 2 → 4 per rack using existing busways - Power consumption per petabyte and per core decreased even as total rack power increased - Network: new 400G-ready DC architecture

Scale context: - Tens of thousands of servers, millions of drives - 40 PB (2012) → 600 PB (2016) → exabyte era (2025) - Since 2015 Magic Pocket migration, >90% of stored data on Dropbox-managed hardware - >99% of storage fleet on SMR

Architecture bits worth extracting

The CPU-selection loop

Dropbox evaluated 100+ processors, filtered by four criteria: maximum system-level throughput, minimum latency for individual processes, best price/performance for Dropbox-specific workloads, and balanced I/O + memory bandwidth. Ran SPECintrate, compared perf/watt and perf/core. "Balanced" is load-bearing — not just raw core counts; an 84-core chip that starves for memory bandwidth would fail criterion four. The 84-core Genoa won both max-throughput and strong-per-core-performance axes.

The single-socket database argument

Dual-socket systems pay inter-socket latency on every cache coherence miss that crosses sockets. For OLTP-shaped databases (write on primary, replicate to secondary), that latency shows up directly as replication lag. Going single-socket, combined with the +30% IPC and the higher base clock, compounded into the 3.57× replication-lag reduction. Named beneficiaries: Dynovault and Edgestore.

The vibration/acoustic design brief

Drive head operates with nanometer precision over a flying gap roughly two sheets of paper thick (cf concepts/hard-drive-physics — Warfield's 747-over-grass analogy). Fan RPM >10k, denser chassis → denser vibration coupling. PES = position error signal; cumulative PES → write fault → drive retry → latency + IOPS degradation. Drive temperature sweet spot ~40 °C (too cold = you can run fans slower for less vibration but you haven't actually saved anything on reliability; too hot = drives age faster and error rates rise). Co-developed chassis addresses: (1) vibration control via acoustical isolation and damping, (2) thermals via fan control + airflow redirection, (3) future-proofing for next-gen drive form factors. This is an instance of concepts/heat-management applied at the chassis/mechanical level rather than at the multi-tenant-placement level S3 operates at — complementary framings.

The PDU-doubling trick

Conventional move: higher power budget requires facility rework (bigger busways, new wiring, possibly new cabinets). Dropbox's move: keep the existing busways, add more receptacles, run more PDUs per rack. 2 → 4. Effectively doubles deliverable power without rebuilding the facility. Tradeoffs not fully enumerated in the post — presumably rack density and cable management both got tighter — but the outcome is 16 kW served from infrastructure that was nominally a 15 kW facility. Generalizes as patterns/pdu-doubling-for-power-headroom.

Systems introduced or surfaced

  • systems/magic-pocket — Dropbox's in-house block storage, the 2015 Amazon S3 exit destination; operates the hardware described here at exabyte scale, >99% on SMR.
  • systems/smr-drives — Shingled magnetic recording; >99% of Dropbox's storage fleet; enables higher-density drives like the 32 TB Ultrastar HC690, at the cost of track-overlap-driven write-amplification for random writes (hence the filesystem+workload shaping on top).
  • systems/crush — 7th-gen compute platform; 84-core Genoa in 1U.
  • systems/dexter — 7th-gen database platform; single-socket Genoa; same vendor platform as Crush.
  • systems/sonic — 7th-gen storage platform; co-developed vibration/thermal chassis for 30+ TB SMR.
  • systems/gumby — 7th-gen flexible GPU tier.
  • systems/godzilla — 7th-gen dense multi-GPU tier.
  • systems/dropbox-dash — The AI product whose workload shape forced the GPU tiers to exist.

Concepts surfaced

  • concepts/hardware-software-codesign — Naming the practice: hardware requirements gathered from software teams before silicon selection, software workload shape fed into chassis/firmware tuning. Dropbox's 7th-gen rollout is an end-to-end instance.
  • concepts/performance-per-watt — Explicit selection criterion in Dropbox's CPU-down-select, not raw performance; paired with per-core perf to avoid picking an energy-efficient chip that underperforms per-thread.
  • concepts/rack-level-power-density — The actual scarce resource: kW/rack, not kW/server. Dropbox models real-world draw against the facility-level budget and adapts the power-distribution topology to fit.
  • concepts/hard-drive-physics (existing) — Second source confirming the Warfield/S3 framing; Dropbox adds the vibration-envelope constraint at 30+ TB that Warfield's IOPS/capacity framing doesn't cover.
  • concepts/heat-management (existing) — Extends the concept from S3's multi-tenant-placement framing to Dropbox's mechanical/chassis framing: same concept applied at a different layer of the stack.

Patterns surfaced

  • patterns/supplier-codevelopment — Long-horizon supplier relationship as a hardware-capability lever: workload telemetry → supplier firmware/hardware customization → early access. Dropbox on storage chassis, on SMR 32 TB drive, on compute firmware. Complements — does not replace — patterns/hackathon-to-platform (own-everything) or off-the-shelf procurement.
  • patterns/pdu-doubling-for-power-headroom — When per-rack power budget is the bottleneck and facility rebuild isn't on the table, duplicate PDU count per rack on existing busways. Concrete 2 → 4 PDU move at Dropbox; reusable template.

Caveats

  • Self-reported and promotional-tone — dropbox.tech publishes post hoc; no independent benchmarks of Crush/Dexter/Sonic vs gen-6, just the 40% SPECintrate / 3.57× replication lag / >10% capacity gain figures Dropbox publishes.
  • No cost numbers. "Better performance per watt" and "lowers cost per terabyte" are stated qualitatively. TCO modeling not disclosed.
  • GPU tier details are sparse. Named SKUs behind Gumby and Godzilla (H100? MI300? L40S?) aren't given; the 75–600 W envelope suggests broad NVIDIA SKU coverage but doesn't pin it.
  • Benchmarking methodology not disclosed. SPECintrate is an industry standard but the "3.57× less replication lag" figure is Dropbox-internal; workload and measurement window not specified.
  • The "maintenance burden" side of consolidating Crush + Dexter onto one vendor platform is asserted, not quantified. Simpler ops is plausible; specific incident-reduction / MTTR numbers not given.
  • "PDU doubling" generalization untested — Dropbox's facility had unused busway capacity and receptacle density to absorb 2 → 4 PDUs. Not every datacenter does. The move generalizes; the precondition (existing headroom in the power distribution) limits its applicability.
Last updated · 200 distilled / 1,178 read