Skip to content

PLANETSCALE 2026-04-21 Tier 3

Read original ↗

PlanetScale — Debugging database errors with Insights

Summary

Rafer Hazen (PlanetScale; originally 2022-09-27, re-surfaced in the 2026-04-21 feed snapshot) announces the error tracking feature of Query Insights and uses a real PlanetScale production debugging session as the case study. The post is a dual disclosure: (a) the mechanism — Insights now surfaces database errors per-query-pattern, each occurrence carrying a timestamp, normalised SQL, and associated tags (most usefully the actor tag that identifies who originated authenticated requests); and (b) a worked example of a genuine check-then-act race condition caught by those tags. PlanetScale staff were seeing occasional upticks in AlreadyExists errors on the database_branch_password table — "a few 10s to a few hundred per day" — against the uniqueness index t.index ["database_branch_id", "display_name"], name: "idx_branch_id_display_name", unique: true defined in the Rails schema. The Rails ActiveRecord DatabaseBranchPassword model declares validates :display_name, uniqueness: { scope: [:database_branch_id] }, and in the dev environment two identical inserts do produce the expected ActiveRecord validation error. Yet in production, two concurrent inserts were racing: each request's uniqueness validation queries the database, finds no conflicting row, and proceeds to insert — then MySQL's unique index rejects one of them at commit with AlreadyExists. Insights' two observability signals — "Occurrences of these errors come in small batches with nearly identical timestamps" and "the actor tag for each batch of errors is always the same" — made the race hypothesis (plus the single misbehaving caller) self-evident without heap dumps, tracing, or log archaeology. The actor lookup revealed "an internal tool was issuing multiple password create requests in parallel"; the fix was to serialise the tool. Post closes by canonicalising the tag-search surface: tag:tag_name:tag_value filters queries/errors by a specific tag value; tag:tag_name matches any value of that key; tag capture on queries has one caveat — "to associate tags, a query pattern must have had at least one query that took more than 1 second, read more than 10k rows, or resulted in an error."

Key takeaways

  1. Canonical check-then-act race condition disclosed from production. Rails' validates :display_name, uniqueness: { scope: [:database_branch_id] } implements a two-step sequence per request: (i) SELECT to check whether a row with the same scope + display name exists; (ii) INSERT the new row if step (i) returned nothing. Between (i) and (ii) there is no lock — so two concurrent requests each complete step (i) before either has completed step (ii), both see the pre-conflict state, both proceed to (ii), and MySQL's unique index rejects one of them at commit time. The post canonicalises this as "two separate application threads could both query the database at the same time, before either thread had created the record, and then proceed as if there was no issue" — the textbook check-then-act race (or TOCTOU — Time-Of-Check-Time-Of-Use) applied to the concrete Rails-ORM + MySQL-unique-index configuration.

  2. Database as the final arbiter of uniqueness — a load-bearing architectural invariant. Rafer states it verbatim: "The database, as the final arbiter of uniqueness, would then only allow one of these queries to succeed and the other would receive the error we see in Insights." Canonicalised on the wiki as patterns/database-as-final-arbiter-of-uniqueness: application-layer validations (Rails uniqueness:, Django UniqueConstraint + model-level clean(), Ecto unique_constraint/3 pre-check, etc.) are advisory — they improve UX by returning friendly errors on the common uncontested path — but they cannot guarantee uniqueness under concurrency. The database's unique index (or equivalent row-lock-based conditional insert) is the only primitive that can, because it serialises at commit time. This also means the application must handle the database uniqueness error gracefully — a failure path Rails' uniqueness validation alone does not cover. The post doesn't name the fix beyond "modify the script to avoid that behavior" (serialise the caller) but implies the deeper resolution: treat the database error as the authoritative signal, not the ActiveRecord validation.

  3. Actor-tagged errors collapse a cross-request correlation problem into a visual signal. Insights "tags all queries from authenticated requests with information about the actor", so every error occurrence in the feature carries an actor tag. When "the actor tag for each batch of errors is always the same", it is immediately visible in the UI that one caller is responsible — no log joins, no request-ID plumbing, no session cross-referencing required. Canonicalised as concepts/actor-tagged-error: the observability mechanism of attaching principal-identity (or internal-tool-identity) to every error event at capture time. Pairs with patterns/actor-tagged-query-observability — the broader pattern of capturing per-query tags (user/service identity, feature flag, tenant, deploy ID) on the database-observability layer so they're available at debug time without application instrumentation changes.

  4. Temporal co-occurrence as the first debugging signal for concurrency bugs. The diagnostic hierarchy Rafer walks is notable: first the error message + column names pointed at the unique index definition; second the temporal clustering ("occurrences of these errors come in small batches with nearly identical timestamps") told him the bug was concurrency not volume; third the actor tag told him it was a single caller not organic user traffic. The post's implicit claim is that a per-query-error feature that exposes {timestamp, normalised_sql, tags} triples — without any custom telemetry — is sufficient to diagnose concurrency bugs that would otherwise require request-ID tracing or heap dumps. The concrete error volume disclosed is "a few 10s to a few hundred per day" — small enough that a log-grep approach would find and dismiss each error individually; large enough that a rate-graph view ("occasional upticks in the new 'Query errors' graph") surfaces the clustering.

  5. tag:key:value and tag:key as the query-search syntax. Insights canonicalises two tag-search forms: tag:tag_name:tag_value filters to queries or errors where the named tag has the specified value; tag:tag_name matches any value of that key (existence check). Canonicalised as concepts/query-tag-filter. Concrete capture caveat: "a query pattern must have had at least one query that took more than 1 second, read more than 10k rows, or resulted in an error" for tags to be associated with a pattern — i.e. normal cheap queries are pattern-aggregated without tag granularity to keep Insights' storage bounded, but the interesting tail (slow / heavy / failing) is captured with tags intact. An instance of [[concepts/temporal-locality- recency-bias|keep-the-interesting-tail-in-full- detail]] observability economics.

  6. Why the dev-environment reproduction succeeded while production failed is itself the bug. The post's aside — "we verified in a development environment that the uniqueness validation seemed to be working correctly: when we try to create two identical DatabaseBranchPassword rows, we get an ActiveRecord error showing that the name has already been taken" — illustrates the classic concurrency-bug testing trap: sequential tests on a single thread exercise path (i) → path (ii) → path (i) → path (ii) and see the ActiveRecord validation catch the second insert (because by then the first one is in the DB). Only concurrent tests — (i)-thread-A, (i)-thread-B, (ii)-thread-A, (ii)-thread-B in interleaved order — expose the race. Rafer's framing is diplomatic ("with the validation working as expected, what could be going on here?") but the operational lesson is: unit / dev-env tests cannot reproduce a check-then-act race without explicit concurrency injection, and the race is ever-present in production for any parallel-request-capable caller.

Insights error-tracking UI + debug walkthrough

Rafer walks the debugging session through three Insights screens:

  1. Errors tab rate graph"occasional upticks in the new 'Query errors' graph on our main production database". A per-error-class time series; the clustering / spikiness is the first-order signal for concurrency bugs (organic traffic produces a flatter error rate).
  2. Errors list — one row per error pattern with the error message ("AlreadyExists" on the database_branch_password table), rollup count, and a "last seen" timestamp. Click-through → occurrences.
  3. Error occurrences page — the load-bearing view. "This page shows the full error message and lists individual occurrences of the errors with the timestamp, normalized SQL query, and any associated tags." Triple-keyed: {timestamp, normalised_sql, tags}. This is where Rafer read the three load- bearing signals:
  4. "The error is coming from the DatabaseBranchPasswords#create action" — inferred from the controller-name tag or the normalised SQL's matching route.
  5. "Occurrences of these errors come in small batches with nearly identical timestamps" — temporal clustering → concurrency bug hypothesis.
  6. "The actor tag for each batch of errors is always the same" — single-caller hypothesis, immediately actionable.

The post frames this as a repeatable debugging workflow: error-pattern → occurrences → tag-based actor lookup → code-level fix. Insights' contribution is making the middle three steps read-only-SQL-no-app-instrumentation — every hop stays in the observability UI until the final code edit.

Unique index + ActiveRecord validation (verbatim)

Rafer discloses the exact Rails schema definition:

t.index ["database_branch_id", "display_name"],
    name: "idx_branch_id_display_name", unique: true

and the model-level validation:

class DatabaseBranchPassword < ApplicationRecord
  # ...
  validates :display_name, uniqueness: { scope: [:database_branch_id] }
  # ...
end

Both are present. Both are correct in isolation. The race exists because the ActiveRecord validation is implemented as a SELECT then INSERT without an intervening FOR UPDATE / row-lock / serialisable- transaction / INSERT ... ON DUPLICATE KEY UPDATE / INSERT ... ON CONFLICT DO NOTHING primitive. It is structurally impossible for any advisory-only check- then-insert pair to avoid a race without a lock acquired between the check and the insert.

Concrete signals the post discloses

  • Error rate: "a few 10s to a few hundred per day" — order-of-magnitude for the AlreadyExists volume at PlanetScale internal-staff workload levels.
  • Error-spike shape: "come in small batches with nearly identical timestamps" — sub-second clustering.
  • Actor identity: "the actor tag for each batch of errors is always the same" — single authenticated caller (an internal tool).
  • Fix kind: "modify the script to avoid that behavior" — caller-side serialisation, not a DB schema / locking / application-code fix on the server side.
  • Tag-capture precondition: "a query pattern must have had at least one query that took more than 1 second, read more than 10k rows, or resulted in an error" — the tail-capture threshold Insights applies.

Caveats

  • Publication date ambiguity: raw frontmatter published: is 2026-04-21 (re-surface date in the feed snapshot); the post's body byline reads "Rafer Hazen | September 27, 2022". Architectural content is still current — the check-then-act race class is timeless, and the Insights tag-search surface has been stable since this post.
  • Tier: PlanetScale is Tier-3 on the wiki's feed taxonomy (marketing + education mix). This post clears scope by (a) being a genuine production debug narrative with concrete numbers (error rate, fix kind, tag-capture threshold), (b) canonicalising the Insights error-tracking + tag-search surface, and (c) the check-then-act race being a canonical distributed-systems hazard that merits a wiki page regardless of the source vehicle. Rafer Hazen also authored the AI-powered index suggestions launch post already on the wiki (sources/2026-04-21-planetscale-ai-powered-postgres-index-suggestions) — same byline, same Insights-centric narrative shape.
  • The fix path the post chose vs the structural fix the post did not ship. Caller-side serialisation is a workaround — it removes the symptom without removing the race. A correct structural fix in the DatabaseBranchPassword creation path would wrap the validation + insert in a transaction with either SELECT ... FOR UPDATE on a row keyed on database_branch_id, or rely on the unique index and rescue ActiveRecord::RecordNotUnique to return the same friendly error the validation returns. Rafer acknowledges this implicitly ("because an interactive user is unlikely to issue password creates quickly enough to trigger this behavior, we were content to call this issue solved") — pragmatic but not the load-bearing fix.
  • Insights as the disclosure vehicle vs the general observation. The post reads as an Insights feature announcement — and it is — but the canonical architectural content is the check-then-act race pattern and the actor-tagged-error observability primitive; the Insights UI is just the vehicle that made both visible. A future ingest from any other database-observability vendor describing the same workflow (Datadog DBM, pganalyze, AWS Performance Insights) would reinforce the same concepts.

Source

Last updated · 347 distilled / 1,201 read