Skip to content

PLANETSCALE 2022-09-08

Read original ↗

PlanetScale — TAOBench: Running social media workloads on PlanetScale

Summary

Liz van Dijk (PlanetScale Tech Solutions, 2022-09-08) introduces TAOBench — Audrey Cheng (UC Berkeley) and the Meta engineering team's open-source benchmark that synthesises Meta's real TAO (social-graph store) workload — and publishes PlanetScale's Vitess-on-MySQL results against it. The post's load-bearing contribution to the wiki is introducing a social-graph-shaped benchmark as a first-class complement to the OLTP-shaped sysbench-tpcc used in the sibling 1M-QPS post. Van Dijk explicitly positions TAOBench as filling a gap TPC-C doesn't cover: "The TPC-C benchmark has had a very long life, and has remained remarkably relevant until this day, but there are scenarios it doesn't cover. Audrey Cheng and her team at University of California, Berkeley identified a real gap when it comes to available synthetic benchmarks for a more recent, but highly pervasive workload type: social media networks."

Three new canonical wiki primitives drop out of this post: (1) the objects + edges data model for the social graph (two tables, a many-to-many relation between entities where edges links objects to other objects) — canonical relational encoding of the social-graph pattern, distinct from Meta's graph-native TAO API surface; (2) the three-phase benchmark protocol (load → bulk reads → experiments) that separates data loading, cache/infrastructure warm-up, and measured experiment — a methodology refinement over single-phase sysbench-style runs; (3) the constrained-resource benchmark with explicit multi-tenant-overhead split — Cheng's team set a 48-CPU-core limit; PlanetScale under-provisioned the query path to 44 cores, reserving 4 cores for multi-tenant infrastructure (edge load balancers). This 44+4 = 48-core split is a new canonical datum for how a multi-tenant serverless database presents capability benchmarks without claiming resources the tenant wouldn't actually own in production.

Van Dijk's key architectural framing is graceful saturation vs. congestive collapse at resource exhaustion: "Most systems have some stretch, even while running at what looks like 100% CPU. With ever increasing workload pressure, though, every piece of software eventually starts experiencing failures, by way of [thrashing], [congestive collapse] or other effects." This becomes the canonical wiki framing for why behaviour beyond saturation is a more important substrate property than peak throughput — peak QPS is a feature-list number, graceful saturation is the architectural property that matters in production. PlanetScale's published takeaway is explicitly this property rather than the QPS number: "Our key takeaway from the initial results as published is the sustained stability of PlanetScale clusters under even the most extreme resource pressure." This reframes what a benchmark is measuring for — substrate maturity (behaviour at the ceiling and beyond) rather than substrate capability (where the ceiling is).

Tier-3 clear via (1) the new benchmark workload shape (social-graph objects + edges) that generalises to any relational system handling social-media-shaped workloads; (2) the VLDB-paper-grade methodology reference with reproducible benchmark code; (3) the architectural framing of graceful-saturation-vs-congestive-collapse as the substrate property that matters more than peak throughput; (4) the canonical 44+4-core multi-tenant overhead disclosure. Architecture density ~55% — the post is partly marketing framing ("why enterprise organizations choose PlanetScale") but the benchmark methodology disclosure, the three-phase protocol, the hot-row/viral-content framing, and the multi-tenant overhead split are all substantive.

Key takeaways

  • TAOBench fills a TPC-C gap for social-media workloads. Verbatim: "The TPC-C benchmark has had a very long life, and has remained remarkably relevant until this day, but there are scenarios it doesn't cover. Audrey Cheng and her team at University of California, Berkeley identified a real gap when it comes to available synthetic benchmarks for a more recent, but highly pervasive workload type: social media networks." Canonical wiki datum: sysbench-tpcc represents OLTP-shaped online-retail-adjacent workloads; TAOBench represents social-graph-shaped workloads, and the distinction matters because "the behavior of these workloads may come across as somewhat synthetic and reductionist" but "the two formulas end up very closely resembling the real world behavior observed at Meta". Pairs with the sibling sysbench-tpcc post — the two posts intentionally cover complementary benchmark shapes so PlanetScale's capability disclosure spans both.

  • Two-table social-graph schema: objects + edges as many-to-many. Verbatim: "The schema for TAOBench is straightforward: it consists of 2 tables, one called objects and one called edges, concepts that loosely translate to the social graph of entities (think 'users', 'posts', 'pictures', etc.) and to the various types of relationships they have with each other (think 'likes', 'shares', 'friendships', etc.). In simple relational database terms: The edges table can be viewed as a 'many-to-many' relationship table that links rows in objects to other rows in objects." Canonical wiki concept: concepts/social-graph-objects-and-edges — the relational encoding of Meta's graph-native TAO model as two tables, where any entity becomes an objects row and any relation becomes an edges row. Relational databases get a canonical shape for social-graph workloads that's orthogonal to Meta's TAO API abstraction.

  • Two workload profiles: Workload A (application/transactional subset) vs Workload O (overall). Verbatim: "Workload A (short for Application) focuses specifically on a transactional subset of the queries. Workload O (short for Overall) encompasses a more generalized profile of the TAO workload." Canonical wiki datum: TAOBench ships two pre-configured workload shapes, with distinct statistical distributions of data in both objects and edges tables depending on the chosen workload. Operational consequence: "data should be reloaded when switching between them" — the workload shape is baked into the load phase, not just the query phase. This is a stronger representation-of-Meta coupling than sysbench-tpcc's Lua-script-level workload shape.

  • Hot-row / thundering-herd pressure as explicit workload-design target. Verbatim: "Focusing the workload around these two simplified concepts allows the benchmark to simulate typical 'hot row' scenarios that can be particularly challenging for relational databases to handle. Think of what happens when something goes viral: a thundering herd of users comes through to interact with a specific piece of content posted somewhere. On the database level, beyond a sudden surge in connections, this can also translate into various types of locks centered around the backing rows for that piece, which can have rippling effects that ultimately translate to slower content access times for the users on the platform." Canonical wiki enrichment: the benchmark explicitly stresses hot-row behaviour and thundering-herd response — the viral-content scenario is a named design target, not an accidental property of the workload. This makes TAOBench the first named benchmark on the wiki that explicitly measures substrate behaviour under viral-content skew, distinct from sysbench-tpcc whose access patterns are shard-key-aligned (i.e., no hot rows by construction).

  • Three-phase benchmark protocol: load → bulk reads (warmup) → experiments. Verbatim: "The TAOBench workload executes in three distinct phases: During the 'load' phase, it performs bulk inserts, populating the objects and edges tables according to the chosen workload scenario. The 'bulk reads' phase (which is initiated at the start of any real benchmark run) performs very aggressive range scans across the entire dataset to serve as general 'warmup' to whichever caching mechanisms may be in place, and also aggregates the necessary statistical information to feed into the experiments themselves. This phase is not measured, but can be extremely punishing to the underlying infrastructure. The 'experiments' phase accepts a set of predefined concurrency levels and runtime operation targets to help scale the chosen workload to various sizes of infrastructure." Canonical wiki datum: TAOBench separates warmup into its own explicit unmeasured phase, which is a methodology improvement over single-phase benchmarks that conflate cold-cache ramp with measured steady-state. The bulk-reads phase is explicitly "extremely punishing to the underlying infrastructure" — it tests range-scan capacity, which is a different substrate axis than the experiments phase's concurrency-driven point-op load. Two independent substrate-axes are exercised in one benchmark run.

  • 48-core resource limit with explicit 44+4 multi-tenant overhead split. Verbatim: "The limit Cheng's team set for their tests was 48 CPU cores. Since the test was performed against PlanetScale's multi-tenant serverless database offering where certain infrastructural resources are shared across multiple tenants, we underprovisioned the 'query path' of the cluster itself to use a maximum of 44 CPU cores out of the requested 48 maximum. The other 4 cores would be used for multi-tenant aspects of the infrastructure, such as edge load balancers." Canonical wiki datum: for multi-tenant serverless database offerings, the tenant-accessible compute budget is smaller than the headline number because some cores are reserved for shared infrastructure (edge load balancers in PlanetScale's case). The 44+4 = 48 split becomes the canonical disclosure pattern for capability-vs-allocation transparency when benchmarking a multi-tenant substrate; a single-tenant substrate would have no such split. New concept: concepts/constrained-resource-benchmark — benchmark with an explicit resource cap (not "as much as we can give it") chosen to represent a production tier rather than the substrate ceiling.

  • Graceful saturation vs. congestive collapse: the ceiling-behaviour property. Verbatim: "Our key takeaway from the initial results as published is the sustained stability of PlanetScale clusters under even the most extreme resource pressure. As is to be expected in an artificially constrained environment, TAOBench's 'experiments' phase uses gradually increasing concurrency pressure to bring the target database to its knees, and once 44 cores are all running at 100%, throughput (measured in requests per second) is expected to hit a ceiling while average response times increase. Most systems have some stretch, even while running at what looks like 100% CPU. With ever increasing workload pressure, though, every piece of software eventually starts experiencing failures, by way of thrashing, congestive collapse or other effects. Distributed database systems are not magically protected from these failure scenarios. If anything, increased infrastructural complexity and the potential for competition amongst different types of resources generally translates to many more interesting ways things can break down." Canonical wiki concept: concepts/graceful-saturation-vs-congestive-collapse — the substrate property that distinguishes a system that plateaus at the ceiling from one that falls over past the ceiling (classic thrashing on a paging subsystem; congestive collapse on a network; cascading timeout/retry storms on a distributed DB). Van Dijk's architectural claim: "Finding the balance between resource efficiency and graceful failure handling requires equal parts of software maturity and ongoing infrastructural engineering excellence." The number reported by a benchmark is the peak QPS; the property that matters for production is the shape of the degradation curve past 100% CPU — whether it plateaus, degrades linearly, or collapses.

  • Published benchmark as independently measured + internally verified. Verbatim: "The results published in Cheng's white paper were independently measured by the TAOBench team against PlanetScale infrastructure. Since the benchmark code has been made publicly available, PlanetScale has been able to verify them internally." Canonical wiki datum: TAOBench's open-source benchmark code enables independent-party benchmarking with reproducible methodology — the Berkeley/Meta team measured PlanetScale externally, then PlanetScale replicated internally. Pairs with the reproducible-benchmark-publication pattern: when benchmark code is public and runnable, third-party numbers become verifiable rather than vendor-claims-only.

Operational numbers

  • 48 CPU cores — total resource limit set by Cheng's team for the benchmark test.
  • 44 CPU cores — allocated to PlanetScale's query path (VTGate + VTTablet + backing MySQL) for this benchmark.
  • 4 CPU cores — reserved for multi-tenant serverless infrastructure (edge load balancers, shared components).
  • 2 tables in the TAOBench schema: objects + edges.
  • 2 workload scenarios shipped: Workload A (application-transactional subset) + Workload O (overall TAO profile).
  • 3 benchmark phases: load (bulk inserts, populated per workload scenario) → bulk reads (aggressive range scans, warmup + statistics aggregation, unmeasured) → experiments (predefined concurrency levels + operation targets, measured).
  • 2 VLDB papersworkload-characterisation paper (VLDB 2021) + TAOBench benchmark paper (VLDB 2022).

Caveats

  • Benchmark results not directly compared to the 1M-QPS sibling post. Van Dijk explicitly positions the TAOBench numbers as deliberately-constrained — "the numbers themselves are by no means approaching the limits of what can be accomplished on our infrastructure. Rather, they represent what is achieved by imposing an explicit resource limit to the cluster." The sysbench-tpcc 1M-QPS post runs on a single-tenant enterprise deployment without a 48-core cap; these two benchmarks exercise different axes of substrate disclosure and should not be directly ratio'd.

  • Multi-tenant serverless deployment is a separate environment class from single-tenant enterprise. The 44+4 overhead split is specific to the multi-tenant serverless tier. Single-tenant enterprise deployments (like the 1M-QPS benchmark's environment) would have no such reserved overhead. Readers comparing PlanetScale benchmarks across posts should not assume the same capacity envelope.

  • Detailed per-phase numbers not in this post. Van Dijk explicitly defers: "Stay tuned for a more in-depth blog post of how the resources were allocated to the various Vitess components we provisioned and for an exploration of how the various OS-level metrics looked." This post is the capability + methodology framing; the full-resource-allocation breakdown is a follow-up. As-of 2026-04-21 crawl, the follow-up post is separate.

  • "Somewhat synthetic and reductionist" — workload is a model, not a replay. Van Dijk acknowledges: "In practice, the behavior of these workloads may come across as somewhat synthetic and reductionist, but taking some time to read the white papers will clarify how the two formulas end up very closely resembling the real world behavior observed at Meta." TAOBench is representative-by-statistical-modelling, not a shadow replay of Meta's actual traffic. The representativeness argument relies on the papers' characterisation work; readers who want the full rigour should read the two VLDB papers.

  • Tier-3 source with marketing framing. The post closes with a direct sales ask: "That is why enterprise organizations are increasingly choosing to have their database workloads managed by PlanetScale. Don't hesitate to contact us..." The architecture content is substantive (~55% of body) but the framing is capability-disclosure-for-sales rather than post-mortem-or-internals. Read alongside the two VLDB papers for the methodology-first view.

Source

Last updated · 550 distilled / 1,221 read