PlanetScale — Performant database tree traversal with Rails¶
Summary¶
Mike Coutermarsh (PlanetScale application-tier Rails engineer; fifth wiki
ingest after 2022-01-18 Rails-CI, 2022-02-17 self-healing-Sidekiq,
2022-06-29 SQLCommenter, 2022-08-16 FastPage, 2024-04-04 PlanetScale
schema-change dogfooding) walks through how PlanetScale's Rails API
solved a classic N+1 query
problem in its
three-way-
merge schema-diff pipeline. Each schema-change branch is represented
as a schema snapshot — a git-commit-
like record with one or two parents — and finding the common-ancestor
merge base between two branches requires a
breadth-first search
over the parent DAG. Small databases have a handful of snapshots and
the BFS runs in milliseconds; large databases have thousands, and
the naïve SchemaSnapshot.find(id) per step degenerates into a tight
network-bound loop of sequential SELECT * FROM schema_snapshots WHERE
id = N queries. Rails' standard includes eager-loading doesn't
help because the traversal shape is dynamic (each step's next node
depends on the previous node's parents, unknown until fetched). The
fix is an ad-hoc SnapshotMergeBaseCache — a plain Hash-backed
in-memory cache scoped to one merge-base calculation, instrumented
with @hits / @misses counters — paired with a bulk-preload step
that loads the FROM_PRELOAD_COUNT and INTO_PRELOAD_COUNT most
recent snapshots of each branch into the cache before the BFS begins.
The preload covers most traversals; the per-step lookup still falls
back to the database for deep histories. The rollout used a correctness
cross-check against thousands of real databases (old method vs new
method must produce the same merge base), then a feature-flagged
gradual ramp with the per-request hit / miss counters surfaced as
telemetry to calibrate the preload count before 100% rollout. Post
closes with a two-alternative comparison:
recursive CTEs (push the traversal into MySQL) and the
materialized-paths technique (store the ancestor chain in a single
column like 20/19/15/10/5/3/1). Both are viable and more elegant
in the abstract, but both require invasive schema changes PlanetScale
had structural reasons to avoid — recursive CTEs would push work onto
the database server at an already-hot multi-tenant tier; materialized
paths scale poorly when relationship histories reach thousands of
entries. The in-memory cache was chosen specifically because it
layers onto existing code without schema changes, which minimised
rollout risk on a customer-impacting code path.
Key takeaways¶
-
N+1 query problem canonical form on the application tier: a loop that does one DB lookup per record — "
select * from schema_snapshots where id = 20,select * from schema_snapshots where id = 21,select * from schema_snapshots where id = 22, … thousands more queries" — where individual queries are fast but the cumulative network round-trip cost dominates. "Even though each query was fast, the network time alone added up quickly." Canonical first wiki source framing the N+1 problem at the generic application-tier altitude, beyond the two existing narrow mentions (cross-shard NL join in concepts/nested-loop-join + batching-bottleneck instance in concepts/network-round-trip-cost). -
Rails
includeseager loading doesn't solve dynamic traversals. "Usually, fixing these in Rails is quite simple. There is anincludesmethod for preloading the data we need. Unfortunately, in this case, due to the data structure, the normal preloading techniques do not work."includesworks for statically-declared association sets ("load allposts+ theircomments") but not for graph walks where step N+1's set of nodes depends on step N's fetched data. The workaround is to predict the likely-accessed set before traversal begins — in this case, "the snapshot history is pretty predictable when finding the merge base. We could preload the X most recent snapshots for each branch." -
BFS as the merge-base algorithm on a 2-parent DAG: PlanetScale models schema change history as "a git commit that stores the state of the schema at a specific time. Each snapshot can have one or two parents. When merging branches, we perform a breadth-first search on the history of each change until we find the common ancestor between both branches. This is the merge base." Same algorithmic shape as git's
merge-baseon a commit DAG with up-to-2 parents per merge commit — textbook lowest-common-ancestor on a DAG via simultaneous BFS from both branch tips. -
Ephemeral in-memory cache scoped to one traversal, hash-backed (keyed by
id), with explicit hit/miss counters:Canonical in-memory cache for tree traversal pattern: no TTL, no invalidation, no cross-request sharing — born at the start of one merge-base calculation, dies at the end. Scope-matched to the work unit. The counters are the point — they make the cache's efficacy measurable in production.class SnapshotMergeBaseCache attr_reader :store, :hits, :misses def initialize @store = {} @misses = 0 @hits = 0 end def get(id) if @store[id] @hits += 1 return @store[id] end @misses += 1 add(SchemaSnapshot.find(id)) end def add(node) @store[node.id] = node end end -
Bulk preload before the dynamic walk: load "the X most recent snapshots for each branch" into the cache in two SQL queries (one per branch, each with a
LIMIT), before kicking off the BFS. The BFS then almost always hits cache during traversal: "Now, when running our breadth-first search, we can usecache.get(id)to find the next node. It hits the cache in most cases, avoiding the network request and solving our performance problem." Two upper-case constants —FROM_PRELOAD_COUNT,INTO_PRELOAD_COUNT— externalise the tuning knobs, which matters for item 6. -
Cache-stats-guided feature rollout: correctness was validated first — "We ran a few tests where we calculated the merge base using both the old and new methods for thousands of databases. This made us confident the new code was returning the correct results." — then a feature-flagged gradual rollout instrumented the in-production hit/miss counters: "We then used feature flags to test rolling out the new code path and recorded data on how it performed. The
hitsandmissesdata proved useful for fine-tuning the number of snapshots we preloaded. After a couple iterations, we released it to 100% of our customers." The cache counters from item 4 are the per-request telemetry that drives thePRELOAD_COUNTtuning loop; feature-flag ramp is the safety envelope. Canonical wiki instance of "the counters you added for debugging become the rollout-calibration signal." -
Two rejected alternatives with explicit reasoning — both involve invasive schema changes that PlanetScale avoided: (a) push the traversal into MySQL via a recursive CTE: "This can be accomplished with a recursive common table expression. With this, the database could follow the pointer to each record until it finds the common merge base." Rejected implicitly — moving work onto a multi-tenant database server is a structural cost the application-tier cache avoids; (b) the materialized-paths technique: "It stores the relationship history in a single column, such as
20/19/15/10/5/3/1. By doing this, you can then look at the history of two nodes and find their common parent. This is a great option that works well for tree structures with a known limit. In our case, storing thousands of relationships didn't make this feasible." Rejected on the scale axis — materialized paths balloon to kilobyte-scale strings when ancestor chains reach thousands of entries, and every insert has to update every descendant's path. -
Layered-on-existing-code as a load-bearing design criterion: "Adding an in-memory cache is just one way of solving this problem. This worked out best for us due to the high number of snapshots we needed to traverse for some databases. It was also simple to layer this solution onto our existing code without many major changes. This reduced the risk when rolling it out." Three-line framing of why application-tier cache was chosen over structurally cleaner alternatives: (a) existing code worked correctly at small scale; (b) schema-change alternatives (CTE, materialized path) would have required data-model changes on a hot customer-impacting table; (c) ephemeral in-memory cache attaches at one call site and changes no data. Rollout risk is the multiplier on architectural elegance.
Systems / concepts extracted¶
Systems¶
- systems/planetscale — the Rails API under discussion.
- systems/ruby-on-rails — the application framework; ActiveRecord
includesis the eager-loading primitive that doesn't help for dynamic traversals. - systems/mysql — the database storing schema snapshots. The post is engine-agnostic but the author's context is MySQL/Vitess.
Concepts¶
- concepts/n-plus-one-query-problem — classic database anti- pattern: one query per iteration of a loop over N items. Canonical first wiki page.
- concepts/breadth-first-search-merge-base — finding the lowest common ancestor in a 2-parent DAG via simultaneous BFS from both branch tips.
- concepts/schema-snapshot — git-commit-like record of a database schema state at a point in time, with one or two parent snapshots to support branch+merge.
- concepts/network-round-trip-cost — the unit cost that dominates when individual queries are fast but many.
- concepts/cache-hit-rate — the load-bearing metric for tuning the preload count.
- concepts/feature-flag — the rollout safety envelope.
- concepts/recursive-cte — rejected alternative: database-side traversal.
Patterns¶
- patterns/in-memory-cache-for-tree-traversal — ephemeral, work-scoped, hash-backed, hit/miss-counted.
- patterns/bulk-preload-before-traversal — load the predictable recent-history set upfront, fall back to per-step lookups for the long tail.
- patterns/cache-stats-guided-feature-rollout — instrument the cache with counters, roll out behind a feature flag, use the counters as the calibration signal.
Operational numbers¶
- Query shape at degradation: "thousands more queries" for one
merge-base calculation on a customer database with a long schema-
change history. Pre-fix, each
SchemaSnapshot.find(id)is one round trip; the aggregate wall-clock is dominated by N × network-round-trip-time, not by MySQL execution. - Preload constants:
FROM_PRELOAD_COUNTandINTO_PRELOAD_COUNT— concrete values not disclosed, but externalised as constants specifically so the counters-driven rollout could tune them without code changes. "Thehitsandmissesdata proved useful for fine-tuning the number of snapshots we preloaded." - Correctness validation: "we calculated the merge base using both the old and new methods for thousands of databases" — N unspecified but explicitly "thousands", which is operationally meaningful (rules out a handful of hand-picked examples).
- Rollout iteration count: "After a couple iterations, we released it to 100%" — canonical example of the feature-flag ramp taking 2-3 adjustments before reaching full rollout.
Caveats¶
- No quantitative before/after numbers. The post frames the problem as "thousands of queries → one round-trip dominated loop" and the fix as "hits cache in most cases", but gives no millisecond measurements, p99 numbers, or query-count reductions. Compare to the same author's 2022-08-16 FastPage post (sources/2026-04-21-planetscale-introducing-fastpage-faster-offset-pagination-for-rails-apps) which quantifies the 2.7× speedup; this post is qualitative only.
- Preload constant values undisclosed.
FROM_PRELOAD_COUNTandINTO_PRELOAD_COUNTare left as named constants; the values Coutermarsh landed on (and the distribution of snapshot-history lengths they're tuned against) are not published. - Cache invalidation is trivial by construction — the cache lives for one merge-base calculation and is discarded — but this means the post doesn't address the separate question of cross- request schema-snapshot caching, which would face the usual invalidation problems on a hot-write table.
- The three alternatives (in-memory cache, recursive CTE, materialized paths) aren't benchmarked against each other. The post's rejection of CTE is implicit ("letting the database do the work" offered as an option, no further analysis); the rejection of materialized paths is on a scale axis ("storing thousands of relationships didn't make this feasible") but the crossover point isn't quantified.
- BFS-as-merge-base algorithmic details elided. The post names
the algorithm but doesn't walk through termination conditions,
duplicate-ancestor handling, or the specific dual-queue shape
needed to handle the 2-parent DAG case. Reader is assumed to
know how git's
merge-baseworks. - No discussion of worst-case preload miss. If a customer has a schema history deep enough that the preload doesn't cover the merge base, the cache offers no speedup and the N+1 loop runs in full. The post doesn't disclose what fraction of traversals hit this case in production, or whether the preload count grows with observed miss rate.
- Not database-engine-specific content. This is an application- tier Rails post — ActiveRecord + an ad-hoc cache class. The same pattern applies to any ORM with eager-load support that doesn't handle dynamic traversals (Django, SQLAlchemy, Hibernate, Eloquent). PlanetScale-specific substrate (MySQL, Vitess, branching) is background context, not the subject.
- Publication-date ambiguity. Byline reads "Mike Coutermarsh |
July 12, 2023"; raw frontmatter shows
published: 2026-04-21(re-fetch date). Dating in this source page uses the 2023-07-12 byline as the authoritative publication date.
Source¶
- Original: https://planetscale.com/blog/performant-database-tree-traversal-with-rails
- Raw markdown:
raw/planetscale/2026-04-21-performant-database-tree-traversal-with-rails-098ed913.md
Related¶
- systems/planetscale
- systems/ruby-on-rails
- systems/mysql
- concepts/n-plus-one-query-problem
- concepts/breadth-first-search-merge-base
- concepts/schema-snapshot
- concepts/network-round-trip-cost
- concepts/cache-hit-rate
- concepts/feature-flag
- concepts/recursive-cte
- patterns/in-memory-cache-for-tree-traversal
- patterns/bulk-preload-before-traversal
- patterns/cache-stats-guided-feature-rollout
- sources/2026-04-21-planetscale-non-blocking-schema-changes — same branching / three-way-merge substrate this post builds on
- sources/2026-04-21-planetscale-how-we-made-planetscales-background-jobs-self-healing — same Mike Coutermarsh Rails-backend voice
- sources/2026-04-21-planetscale-introducing-fastpage-faster-offset-pagination-for-rails-apps — same author, same Rails-backend altitude, quantitative twin