Skip to content

CONCEPT Cited by 1 source

Distributed transactions

Distributed transactions are atomic operations that span multiple database rows, shards, or nodes — committing all parts or none, across machine boundaries. They are the most-requested missing feature in NoSQL stores and one of the load-bearing reasons cited in storage-engine deprecations.

Definition

A distributed transaction provides ACID semantics — atomicity, consistency, isolation, durability — across multiple participants that do not share memory or a single write- ahead log. Typical implementations:

  • Two-phase commit (2PC) — a coordinator orchestrates a prepare phase followed by a commit phase; participants promise durability in prepare, then commit atomically on coordinator's go.
  • Percolator / Omid — snapshot-isolation-based; uses a central timestamp oracle and per-row primary/secondary lock cells in the data layer. Apache Omid implements this on HBase.
  • Calvin / deterministic — pre-order all transactions globally and execute deterministically on each replica.
  • TrueTime / Spanner — bounded clock uncertainty enables external consistency without a central oracle.

Why missing distributed transactions is load-bearing

NoSQL stores that provide only per-row or per-partition atomicity force the application layer to implement multi-row consistency by hand. The canonical failure mode:

"The lack of distributed transactions in HBase led to a number of bugs and incidents of Zen, our in-house graph service, because partially failed updates could leave a graph in an inconsistent state. Debugging such problems was usually difficult and time- consuming, causing frustration for service owners and their customers." (Source: sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest)

Graph updates — "add edge A → B" — typically touch both vertex rows and an edge row. Without distributed transactions, a crash or partial failure mid-update can leave the graph in an inconsistent state (edge exists but target vertex doesn't, or vice versa). Application code can try to detect and repair, but that becomes a parallel consistency-maintenance codebase — the "consistency-by-application" anti-pattern.

Why it's hard

  • Network partitions. Participants can fail, disappear, or be slow; the coordinator has to handle every combination.
  • Clock skew. Snapshot-isolation schemes need either a central oracle (bottleneck) or bounded clock uncertainty (infra-level).
  • Performance. Cross-shard coordination costs — typical 2PC adds ≥ 1 round-trip to every transaction; Percolator-style adds lock contention under hot keys.
  • Failure recovery. Coordinator failure mid-commit is the classic 2PC pain point.

Canonical bolt-on: Omid on HBase

Pinterest's Sparrow service implemented distributed transactions over HBase by layering Apache Phoenix + Omid on top. This is the canonical example of the complexity-tax axis of patterns/nosql-to-newsql-deprecation — the substrate lacks the feature, so the org builds a bolt-on service to provide it, and that bolt-on is its own maintenance burden.

Seen in

  • sources/2024-05-14-pinterest-hbase-deprecation-at-pinterest — canonical wiki example of the real-world cost of missing distributed transactions: graph-service inconsistency incidents, difficult debugging, bolt-on Sparrow-on-Omid architecture to paper over the gap. Named as axis 2 of the five-reason deprecation framework.
Last updated · 550 distilled / 1,221 read