PLANETSCALE 2023-07-12

PlanetScale — Performant database tree traversal with Rails¶

Summary¶

Mike Coutermarsh (PlanetScale application-tier Rails engineer; fifth wiki ingest after 2022-01-18 Rails-CI, 2022-02-17 self-healing-Sidekiq, 2022-06-29 SQLCommenter, 2022-08-16 FastPage, 2024-04-04 PlanetScale schema-change dogfooding) walks through how PlanetScale's Rails API solved a classic N+1 query problem in its three-way- merge schema-diff pipeline. Each schema-change branch is represented as a schema snapshot — a git-commit- like record with one or two parents — and finding the common-ancestor merge base between two branches requires a breadth-first search over the parent DAG. Small databases have a handful of snapshots and the BFS runs in milliseconds; large databases have thousands, and the naïve SchemaSnapshot.find(id) per step degenerates into a tight network-bound loop of sequential SELECT * FROM schema_snapshots WHERE id = N queries. Rails' standard includes eager-loading doesn't help because the traversal shape is dynamic (each step's next node depends on the previous node's parents, unknown until fetched). The fix is an ad-hoc SnapshotMergeBaseCache — a plain Hash-backed in-memory cache scoped to one merge-base calculation, instrumented with @hits / @misses counters — paired with a bulk-preload step that loads the FROM_PRELOAD_COUNT and INTO_PRELOAD_COUNT most recent snapshots of each branch into the cache before the BFS begins. The preload covers most traversals; the per-step lookup still falls back to the database for deep histories. The rollout used a correctness cross-check against thousands of real databases (old method vs new method must produce the same merge base), then a feature-flagged gradual ramp with the per-request hit / miss counters surfaced as telemetry to calibrate the preload count before 100% rollout. Post closes with a two-alternative comparison: recursive CTEs (push the traversal into MySQL) and the materialized-paths technique (store the ancestor chain in a single column like 20/19/15/10/5/3/1). Both are viable and more elegant in the abstract, but both require invasive schema changes PlanetScale had structural reasons to avoid — recursive CTEs would push work onto the database server at an already-hot multi-tenant tier; materialized paths scale poorly when relationship histories reach thousands of entries. The in-memory cache was chosen specifically because it layers onto existing code without schema changes, which minimised rollout risk on a customer-impacting code path.

Key takeaways¶

N+1 query problem canonical form on the application tier: a loop that does one DB lookup per record — "select * from schema_snapshots where id = 20, select * from schema_snapshots where id = 21, select * from schema_snapshots where id = 22, … thousands more queries" — where individual queries are fast but the cumulative network round-trip cost dominates. "Even though each query was fast, the network time alone added up quickly." Canonical first wiki source framing the N+1 problem at the generic application-tier altitude, beyond the two existing narrow mentions (cross-shard NL join in concepts/nested-loop-join + batching-bottleneck instance in concepts/network-round-trip-cost).
Rails includes eager loading doesn't solve dynamic traversals. "Usually, fixing these in Rails is quite simple. There is an includes method for preloading the data we need. Unfortunately, in this case, due to the data structure, the normal preloading techniques do not work." includes works for statically-declared association sets ("load all posts + their comments") but not for graph walks where step N+1's set of nodes depends on step N's fetched data. The workaround is to predict the likely-accessed set before traversal begins — in this case, "the snapshot history is pretty predictable when finding the merge base. We could preload the X most recent snapshots for each branch."
BFS as the merge-base algorithm on a 2-parent DAG: PlanetScale models schema change history as "a git commit that stores the state of the schema at a specific time. Each snapshot can have one or two parents. When merging branches, we perform a breadth-first search on the history of each change until we find the common ancestor between both branches. This is the merge base." Same algorithmic shape as git's merge-base on a commit DAG with up-to-2 parents per merge commit — textbook lowest-common-ancestor on a DAG via simultaneous BFS from both branch tips.

Ephemeral in-memory cache scoped to one traversal, hash-backed (keyed by id), with explicit hit/miss counters:

class SnapshotMergeBaseCache
  attr_reader :store, :hits, :misses
  def initialize
    @store = {}
    @misses = 0
    @hits = 0
  end
  def get(id)
    if @store[id]
      @hits += 1
      return @store[id]
    end
    @misses += 1
    add(SchemaSnapshot.find(id))
  end
  def add(node)
    @store[node.id] = node
  end
end

Canonical in-memory cache for tree traversal pattern: no TTL, no invalidation, no cross-request sharing — born at the start of one merge-base calculation, dies at the end. Scope-matched to the work unit. The counters are the point — they make the cache's efficacy measurable in production.

Bulk preload before the dynamic walk: load "the X most recent snapshots for each branch" into the cache in two SQL queries (one per branch, each with a LIMIT), before kicking off the BFS. The BFS then almost always hits cache during traversal: "Now, when running our breadth-first search, we can use cache.get(id) to find the next node. It hits the cache in most cases, avoiding the network request and solving our performance problem." Two upper-case constants — FROM_PRELOAD_COUNT, INTO_PRELOAD_COUNT — externalise the tuning knobs, which matters for item 6.
Cache-stats-guided feature rollout: correctness was validated first — "We ran a few tests where we calculated the merge base using both the old and new methods for thousands of databases. This made us confident the new code was returning the correct results." — then a feature-flagged gradual rollout instrumented the in-production hit/miss counters: "We then used feature flags to test rolling out the new code path and recorded data on how it performed. The hits and misses data proved useful for fine-tuning the number of snapshots we preloaded. After a couple iterations, we released it to 100% of our customers." The cache counters from item 4 are the per-request telemetry that drives the PRELOAD_COUNT tuning loop; feature-flag ramp is the safety envelope. Canonical wiki instance of "the counters you added for debugging become the rollout-calibration signal."
Two rejected alternatives with explicit reasoning — both involve invasive schema changes that PlanetScale avoided: (a) push the traversal into MySQL via a recursive CTE: "This can be accomplished with a recursive common table expression. With this, the database could follow the pointer to each record until it finds the common merge base." Rejected implicitly — moving work onto a multi-tenant database server is a structural cost the application-tier cache avoids; (b) the materialized-paths technique: "It stores the relationship history in a single column, such as 20/19/15/10/5/3/1. By doing this, you can then look at the history of two nodes and find their common parent. This is a great option that works well for tree structures with a known limit. In our case, storing thousands of relationships didn't make this feasible." Rejected on the scale axis — materialized paths balloon to kilobyte-scale strings when ancestor chains reach thousands of entries, and every insert has to update every descendant's path.
Layered-on-existing-code as a load-bearing design criterion: "Adding an in-memory cache is just one way of solving this problem. This worked out best for us due to the high number of snapshots we needed to traverse for some databases. It was also simple to layer this solution onto our existing code without many major changes. This reduced the risk when rolling it out." Three-line framing of why application-tier cache was chosen over structurally cleaner alternatives: (a) existing code worked correctly at small scale; (b) schema-change alternatives (CTE, materialized path) would have required data-model changes on a hot customer-impacting table; (c) ephemeral in-memory cache attaches at one call site and changes no data. Rollout risk is the multiplier on architectural elegance.

Systems / concepts extracted¶

Systems¶

systems/planetscale — the Rails API under discussion.
systems/ruby-on-rails — the application framework; ActiveRecord includes is the eager-loading primitive that doesn't help for dynamic traversals.
systems/mysql — the database storing schema snapshots. The post is engine-agnostic but the author's context is MySQL/Vitess.

Concepts¶

concepts/n-plus-one-query-problem — classic database anti- pattern: one query per iteration of a loop over N items. Canonical first wiki page.
concepts/breadth-first-search-merge-base — finding the lowest common ancestor in a 2-parent DAG via simultaneous BFS from both branch tips.
concepts/schema-snapshot — git-commit-like record of a database schema state at a point in time, with one or two parent snapshots to support branch+merge.
concepts/network-round-trip-cost — the unit cost that dominates when individual queries are fast but many.
concepts/cache-hit-rate — the load-bearing metric for tuning the preload count.
concepts/feature-flag — the rollout safety envelope.
concepts/recursive-cte — rejected alternative: database-side traversal.

Patterns¶

patterns/in-memory-cache-for-tree-traversal — ephemeral, work-scoped, hash-backed, hit/miss-counted.
patterns/bulk-preload-before-traversal — load the predictable recent-history set upfront, fall back to per-step lookups for the long tail.
patterns/cache-stats-guided-feature-rollout — instrument the cache with counters, roll out behind a feature flag, use the counters as the calibration signal.

Operational numbers¶

Query shape at degradation: "thousands more queries" for one merge-base calculation on a customer database with a long schema- change history. Pre-fix, each SchemaSnapshot.find(id) is one round trip; the aggregate wall-clock is dominated by N × network-round-trip-time, not by MySQL execution.
Preload constants: FROM_PRELOAD_COUNT and INTO_PRELOAD_COUNT — concrete values not disclosed, but externalised as constants specifically so the counters-driven rollout could tune them without code changes. "The hits and misses data proved useful for fine-tuning the number of snapshots we preloaded."
Correctness validation: "we calculated the merge base using both the old and new methods for thousands of databases" — N unspecified but explicitly "thousands", which is operationally meaningful (rules out a handful of hand-picked examples).
Rollout iteration count: "After a couple iterations, we released it to 100%" — canonical example of the feature-flag ramp taking 2-3 adjustments before reaching full rollout.

Caveats¶

No quantitative before/after numbers. The post frames the problem as "thousands of queries → one round-trip dominated loop" and the fix as "hits cache in most cases", but gives no millisecond measurements, p99 numbers, or query-count reductions. Compare to the same author's 2022-08-16 FastPage post (sources/2026-04-21-planetscale-introducing-fastpage-faster-offset-pagination-for-rails-apps) which quantifies the 2.7× speedup; this post is qualitative only.
Preload constant values undisclosed. FROM_PRELOAD_COUNT and INTO_PRELOAD_COUNT are left as named constants; the values Coutermarsh landed on (and the distribution of snapshot-history lengths they're tuned against) are not published.
Cache invalidation is trivial by construction — the cache lives for one merge-base calculation and is discarded — but this means the post doesn't address the separate question of cross- request schema-snapshot caching, which would face the usual invalidation problems on a hot-write table.
The three alternatives (in-memory cache, recursive CTE, materialized paths) aren't benchmarked against each other. The post's rejection of CTE is implicit ("letting the database do the work" offered as an option, no further analysis); the rejection of materialized paths is on a scale axis ("storing thousands of relationships didn't make this feasible") but the crossover point isn't quantified.
BFS-as-merge-base algorithmic details elided. The post names the algorithm but doesn't walk through termination conditions, duplicate-ancestor handling, or the specific dual-queue shape needed to handle the 2-parent DAG case. Reader is assumed to know how git's merge-base works.
No discussion of worst-case preload miss. If a customer has a schema history deep enough that the preload doesn't cover the merge base, the cache offers no speedup and the N+1 loop runs in full. The post doesn't disclose what fraction of traversals hit this case in production, or whether the preload count grows with observed miss rate.
Not database-engine-specific content. This is an application- tier Rails post — ActiveRecord + an ad-hoc cache class. The same pattern applies to any ORM with eager-load support that doesn't handle dynamic traversals (Django, SQLAlchemy, Hibernate, Eloquent). PlanetScale-specific substrate (MySQL, Vitess, branching) is background context, not the subject.
Publication-date ambiguity. Byline reads "Mike Coutermarsh | July 12, 2023"; raw frontmatter shows published: 2026-04-21 (re-fetch date). Dating in this source page uses the 2023-07-12 byline as the authoritative publication date.

Source¶

systems/planetscale
systems/ruby-on-rails
systems/mysql
concepts/n-plus-one-query-problem
concepts/breadth-first-search-merge-base
concepts/schema-snapshot
concepts/network-round-trip-cost
concepts/cache-hit-rate
concepts/feature-flag
concepts/recursive-cte
patterns/in-memory-cache-for-tree-traversal
patterns/bulk-preload-before-traversal
patterns/cache-stats-guided-feature-rollout
sources/2026-04-21-planetscale-non-blocking-schema-changes — same branching / three-way-merge substrate this post builds on
sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing — same Mike Coutermarsh Rails-backend voice
sources/2026-04-21-planetscale-introducing-fastpage-faster-offset-pagination-for-rails-apps — same author, same Rails-backend altitude, quantitative twin