Skip to content

DATABRICKS 2026-04-30

Read original ↗

Databricks — Backstage with Lakebase (Part 1: Deployment Cycles)

Summary

Thoughtworks ran a proof-of-concept ripping Backstage (Spotify's state-heavy Internal Developer Portal) off its standard Postgres database and pointing it at Databricks Lakebase (Neon-lineage serverless Postgres). The post is Part 1 of a three-part series (Part 1: Deployment Cycles, Part 2: Governance, Part 3: FinOps) and focuses on what happens to database development cycles when creating a copy of the database becomes functionally free. Two operational datapoints anchor the architecture discussion: a 63 MB Backstage catalog branch lands in 1.09 seconds (data plane), and a Point-in-Time Recovery from deleted state completes end-to-end in 3.78 seconds. The thesis is that this collapses two separate engineering practices into the same primitive — branching is PITR with source_branch_time = now — and rearranges the development cycle enough to deprecate 20-30% of test code (mock objects for database interfaces).

Key takeaways

  1. Wire-protocol-Postgres compatibility is the first-order integration property. "Because it speaks wire-protocol Postgres, Backstage doesn't know or care that it isn't talking to RDS." Backstage's application logic, Knex migrations, and PgSearchEngine swap all ran cleanly after pointing app-config.yaml at Lakebase. The only integration friction was at the auth tier, not the protocol tier. This is the operational payoff of Lakebase's Neon-lineage choice to keep upstream Postgres semantics while rewriting the storage layer. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  2. Lakebase rejects classic Databricks PATs; expects OAuth JWTs. Verbatim: "Lakebase rejects classic Databricks Personal Access Tokens, expecting an OAuth JWT instead." The Databricks CLI provides databricks postgres generate-database-credential which mints a scoped, short-lived JWT for a specific endpoint — "the intended approach for apps and CI." For the Backstage POC, Thoughtworks wrapped the command in a lightweight cron script that rewrote DATABRICKS_TOKEN in the .env file every 50 minutes to handle token expiration. Canonical patterns/credential-refresh-cron-as-auth-compat-shim — the gap between the short-lived-JWT model Lakebase prefers and the long-lived-credential model a legacy integration assumes. See also concepts/oauth-jwt-short-lived-credential. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  3. Branching is instant because it's a pointer, not a copy. The post names the mechanism explicitly: "Because Lakebase separates storage from compute using a copy-on-write architecture, creating a branch doesn't copy any data, it creates a pointer to the same underlying pages, and only diverges on write." This is the concepts/copy-on-write-storage-fork primitive; the Neon-lineage systems/pageserver-safekeeper is the substrate that makes it possible. What the CMK-era Lakebase page disclosed as a storage architecture is named here as a developer-cycle primitive. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  4. Branch API requires a spec-nested body with an explicit lifetime. Undocumented gotcha: "the request body must nest everything inside a spec object, and you must specify ttl, expire_time, or no_expiry. Without that, the API returns 'Expiration must be specified.'" This is the first wiki-ingested concrete detail of Lakebase's branch-creation API surface — the lifetime declaration is mandatory, not optional. Canonicalises a design choice: branches are short-lived by default and long-lived-ness requires explicit opt-in. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  5. Disclosed branching throughput: ~63 MB Backstage catalog → 1.09-second data-plane clone. "The control plane acknowledged it instantly. The actual data-plane clone of the ~63 MB Backstage catalog landed in 1.09 seconds." First wiki operational datapoint on Lakebase/Neon branch-creation time at MB-scale dataset granularity — prior ingests (LangGuard 2026-04-27, Stripe Projects 2026-04-29) disclosed branching latency only as "seconds" or "sub-350 ms" for cold Postgres provisioning. This post separates the control-plane acknowledgement (instant) from the data-plane clone (1.09 s for 63 MB). At this size the branching cost is dominated by fixed setup, not data volume — predictable from the copy-on-write architecture. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  6. Point-in-Time Recovery (PITR) completes end-to-end in 3.78 seconds. The POC wiped final_entities (32 rows → 0), then created a recovery branch from a timestamp captured seconds before the delete. "The elapsed time end-to-end was 3.78 seconds. Verifying the data confirmed the recovered branch had all 32 entities back; production was still at zero, confirming the delete was real and the branches are fully isolated." Canonical concepts/point-in-time-recovery at Lakebase/Neon altitude; completes at an order of magnitude faster than the traditional snapshot-restore shape (minutes to hours for RDS). (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  7. WAL-record granularity: recovery snapped backward 12 seconds to the nearest record. "Notably, we asked for 22:56:02Z, but Lakebase snapped to 22:55:50Z, 12 seconds earlier, snapping backward to the nearest WAL record." Canonicalises concepts/wal-record-granularity as a first-class property: PITR granularity is bounded by WAL-record cadence, not by the caller's timestamp precision. Always rounds backward to the nearest known durable state — a structural property, not a bug — but load-bearing for time-sensitive recovery workflows because the user's chosen target time is best-effort. The incident cycle still ran in under a minute. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  8. Branching is PITR-with-time-now. Architectural unification disclosed in one sentence: "Branching and Point-in-Time Recovery (PITR) are essentially the same primitive: branching is just PITR with source_branch_time = now." Canonical patterns/branching-is-pitr-with-time-now. The two operations are the same control-plane call with a different time parameter; the storage substrate is the same concepts/copy-on-write-storage-fork. This unification is architecturally load-bearing because it means every risky operation gets a dry run and every incident gets an undo — "When database state becomes a cheap, forkable artifact instead of a 2 TB EBS volume, every risky operation gets a dry run, and every incident gets an undo." (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  9. Cheap branching deprecates mock objects. The post's sprint- cycle comparison makes the claim concrete: "In our experience across multiple partner teams evaluating this workflow, mock objects account for 20-30% of test code. That's not test coverage — it's test infrastructure. Infrastructure that diverges from production behavior over time, creating false confidence. When branching a production-equivalent database costs nothing, mocking becomes the expensive choice." Canonical concepts/mock-object-maintenance-cost + patterns/database-branch-per-test-over-mocking. The structural insight is that mock objects have a maintenance cost (divergence from production behavior) + a correctness cost (false confidence from passing tests that don't reflect production) that was previously justified only by the unavailability of cheap real-database environments. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

  10. Developer workflow rearrangement, not just feature addition. The post's before/after comparison names seven steps in the traditional cycle that collapse or disappear with branching: staging collisions, "works on my machine but breaks in staging", schema-migration-against-real-data-surprises, mock maintenance. Replaced with: per-branch IDE database, per-PR CI branch + schema diff, per-QA-member destructive-test branch, post-merge clean-up. See concepts/integration-tests-against-real-database for the workflow pivot point and patterns/database-branch-per-test-over-mocking for the formalised shape. (Source: sources/2026-04-30-databricks-backstage-with-lakebase)

Systems / concepts / patterns extracted

Systems:

  • systems/lakebase — canonical system this POC deploys; third full-capability demonstration after CMK (2026-04-20) and LangGuard (2026-04-27).
  • systems/backstage — Spotify's open-source Internal Developer Portal; the canonical state-heavy application used as a migration- stress-test here.
  • systems/pageserver-safekeeper — the Neon-lineage durable storage tier that makes instant branching + PITR work.
  • systems/databricks-postgres-cli — the databricks postgres generate-database-credential command, Lakebase's intended short-lived-JWT auth path.
  • systems/thoughtworks-technology-radar — the consulting-firm Technology Radar that endorsed Backstage as an IDP foundation, motivating this POC.
  • systems/postgresql — the wire protocol + semantics Lakebase preserves.

Concepts:

Patterns:

Operational numbers

Metric Value Notes
Backstage catalog size ~63 MB Full Backstage metadata graph in the POC
Branch creation (data plane) 1.09 s Control plane ack was instant
PITR end-to-end recovery 3.78 s 32 rows deleted then recovered via branch
WAL-record snap-back 12 s Requested 22:56:02Z, got 22:55:50Z
Credential refresh cadence every 50 min Cron-based workaround for short-lived JWT
Test-code savings claim 20-30% of test code Mock objects across evaluated teams

No latency / throughput / concurrency numbers beyond these; the post is a workflow-transformation showcase, not a capacity benchmark.

Caveats

  • Tier-3 single-vendor POC with Thoughtworks as guest author. The 1.09-second and 3.78-second numbers are single-shot measurements in a development environment, not production-scale benchmarks. The post does not disclose variance across repeated runs, concurrent-branch-creation load, or geographic latency. The 20-30% mock-code claim is attributed to "our experience across multiple partner teams evaluating this workflow" with no count, methodology, or comparison group.
  • 63 MB is a developer-IDP-scale dataset. Whether the 1.09-second branching time scales linearly, sub-linearly, or cliff-edges at GB / TB dataset sizes is not disclosed. The copy-on-write architecture predicts near-constant time (fixed control-plane + metadata pointer work) but the POC does not verify this at larger scales.
  • PITR granularity is WAL-record-bounded and WAL-cadence-dependent. The 12-second snap-back is a function of Lakebase's WAL-write cadence in the POC's configuration; different workload intensities and configurations will produce different granularities. The post explicitly flags this as "an important caveat for time-sensitive recovery workflows."
  • Auth workaround (50-minute cron refresh) is a POC hack, not a production pattern. Databricks would likely guide production integrations to use the CLI directly or embed credential refresh into the application's connection layer. The .env-rewrite approach is a Thoughtworks-POC choice driven by Backstage's configuration-file expectations.
  • Cross-part series — this is Part 1. Part 2 (Governance) and Part 3 (FinOps) are referenced as forthcoming; their content is not in scope here. This ingest captures only the deployment-cycles + branching + PITR surface.
  • "Instant branches for performance tests, disposable branches for functional tests, running branch for UAT" workflow description is aspirational — the POC demonstrated branching and PITR but not the full multi-branch CI/CD topology. Later posts in the series may cover this.

Source

Last updated · 439 distilled / 1,268 read