Skip to content

CONCEPT Cited by 1 source

Point-in-Time Recovery (PITR)

Definition

Point-in-Time Recovery (PITR) is the database capability of producing a fresh, queryable copy of the database at a chosen past timestamp, typically for:

  • Incident undo"we dropped the wrong table 4 minutes ago — give us a version from 5 minutes ago."
  • Accidental-delete recovery — restore specific rows wiped by a bad query, pipeline bug, or user error.
  • Forensic / audit — inspect historical state without running a transaction-replay.
  • Dev / QA / staging — fork a past production state as a sandbox without the replay cost of log-shipped standby.

The classical implementation — periodic full snapshots + WAL / binlog replay between the snapshot and the target time — is operationally expensive: restore a 2 TB snapshot to a new EBS volume (minutes to hours), then replay WAL forward to the target (more minutes). For a live incident this makes PITR the option of last resort rather than the first thing tried.

PITR on compute-storage-separated substrates

On a compute-storage- separated substrate where the storage layer already keeps historical page versions + the WAL as durable shared state (for example Pageserver + Safekeeper on Neon / Lakebase), PITR collapses into a copy-on-write fork targeted at a past timestamp. No snapshot restore, no physical copy, no minutes-long wait. The operation becomes:

  1. Ask the storage layer to expose a logical view of the pages as they were at time t.
  2. Point a new compute instance at that logical view.
  3. Connect and query.

Each of these is sub-second. PITR's wall-clock cost becomes dominated by the control-plane round-trip + compute boot, not by the data volume.

Canonical disclosure (Lakebase, 2026-04-30)

"Branching and Point-in-Time Recovery (PITR) are essentially the same primitive: branching is just PITR with source_branch_time = now." — Thoughtworks Backstage POC.

Measured: 3.78 seconds end-to-end from a wipe of the final_entities table (32 rows → 0) to a recovery branch with all 32 entities restored, while production itself was still at zero (branches are fully isolated). See sources/2026-04-30-databricks-backstage-with-lakebase.

This is an order of magnitude faster than the traditional snapshot-restore-plus-WAL-replay PITR shape, and reframes PITR from "last-resort disaster-recovery operation" to "routine undo-button you hit when a command goes wrong."

Target-time granularity (WAL-bounded)

PITR's target-time is not caller-specified precision — it is bounded by the WAL-record cadence of the underlying store. Ask for 22:56:02Z; get 22:55:50Z (12 seconds earlier) if the nearest durable WAL record is at 22:55:50Z. This is the concepts/wal-record-granularity property — PITR always snaps backward to the nearest known durable state.

For time-sensitive workflows (e.g. "recover to just before the bad commit at T") this means the caller must request T − ε for sufficient margin, rather than T exactly. The Lakebase POC disclosed a 12-second snap-back as representative; different workload intensities + WAL-write cadences will produce different granularities.

Contrast: branching vs PITR

On a copy-on-write-capable substrate, the two are the same operation with a different source_branch_time:

Branch PITR
Source time now past timestamp
Data content current production state state at past time
Mechanism COW-fork at current head COW-fork at historical head
Use case pre-deploy test, dev sandbox, policy testing, agent operation accidental-delete recovery, incident undo
Latency 1.09 s (63 MB Backstage catalog) 3.78 s (recovery + verify)

See patterns/branching-is-pitr-with-time-now for the architectural unification.

Prior art on classical substrates

  • AWS RDS / Aurora — PITR supported via automated backups + transaction log retention. Recovery is restore-to-new- instance (fresh RDS instance provisioned, historical state materialised, minutes to hours depending on data volume). See systems/amazon-aurora.
  • MySQL / Postgres DIYpg_basebackup + WAL archive; operator runs pg_createsubscriber or pg_wal_replay on a fresh cluster.
  • Traditional snapshot systems — storage-tier snapshot (EBS / ZFS / LVM) plus log-replay on top.

All of these are available; none of them is fast enough to be the first thing you try during a live incident. The compute-storage-separated substrate's advantage isn't that PITR becomes possible — it becomes cheap enough to be routine.

Seen in

  • sources/2026-04-30-databricks-backstage-with-lakebase — canonical first wiki instance of PITR at Lakebase / Neon altitude. Thoughtworks POC demonstrates 3.78-second end-to-end recovery from a 32-row-deletion incident, with 12-second WAL-snap-back granularity disclosed. Canonicalises the branching ≡ PITR-with-time-now architectural unification + the "every incident gets an undo" framing that follows from PITR at sub-10-second latencies.
Last updated · 542 distilled / 1,571 read