Skip to content

CONCEPT Cited by 1 source

Asynchronous cascade delete

Asynchronous cascade delete is a strategy for removing a node (or other parent record) from a high-fanout graph store where:

  • The node may have thousands to millions of connected edges that must be cleaned up to maintain graph integrity.
  • A synchronous delete that waits for all edge removals would blow the caller's latency budget.

The strategy: delete the node record immediately and queue the connected-edge cleanup for asynchronous processing. The caller sees a fast successful response; the system continues removing the dangling edges in the background.

The catch — and the load-bearing structural argument behind the correctness — is that concurrent updates during the in- flight delete can race with the cascade. Last-Write-Wins is the primitive that makes the race resolve correctly.

The structural problem

If a node N is being deleted asynchronously while a client issues update(N, prop=v) concurrently, two timelines are possible:

  1. The update happened-before the delete in real time. The user expects the update's effect to be wiped (delete won).
  2. The update happened-after the delete in real time. The user expects the update's effect to survive — but actually they expect the delete to win because the update was issued against a node that was already meant to be deleted.

A naive ordering by physical arrival could resurrect the deleted state: the cascade-delete pipeline sweeps the edges, but a late- arriving update at a non-cascaded edge re-creates the relationship.

The fix — LWW with timestamped idempotency tokens

Idempotency tokens carry the timestamp at which the operation was intended; LWW resolves concurrent writes to a record by picking the highest timestamp. The cascade-delete pipeline issues per-edge deletes carrying the delete intent's timestamp, not the wall-clock time of when the cascade fired. Any concurrent update with a later intent timestamp wins (and the user's update is preserved on edges the cascade has not yet touched). Any concurrent update with an earlier intent timestamp loses to the delete.

Verbatim from Netflix Graph Abstraction Part I:

"Deleting nodes in a highly connected graph is more complex than simply removing a KV record as each node may have thousands of connected edges that must be handled to maintain graph integrity. Further, synchronously deleting all such connections would introduce unacceptable latency for the Abstraction callers. The Abstraction employs an asynchronous deletion strategy to manage this issue. The consequence of this approach, however, is that the observed mutated state is only eventually consistent. Further, to ensure correctness of asynchronous deletes during concurrent updates, the Last-Write-Wins (LWW) conflict resolution mechanism is essential."

Mechanism

caller: delete(N, t = T_d)
graph layer:
  1. write delete-intent record for N at timestamp T_d
  2. enqueue cascade-delete job for N at timestamp T_d
  3. respond OK to caller (sub-second)
async cascade worker:
  for each edge e adjacent to N:
    issue delete(e, t = T_d)        # carrying the delete intent timestamp
                                    # not the worker's clock
  for each property p of N:
    issue delete(p, t = T_d)
LWW resolution at storage layer:
  - any concurrent update with t < T_d → loses to delete
  - any concurrent update with t > T_d → wins (operation against
    a node that was already deleted; semantics handled by
    application above)

Observable properties

  • Latency for the caller — fast, near-immediate (Netflix reports "sub-second" typical for the full async-cascade trajectory).
  • Eventual graph integrity — every connected edge eventually removed; the system may briefly serve queries that walk into the to-be-deleted node's neighbourhood.
  • Concurrency safety — LWW resolves all races deterministically.

Why naïve sync-delete fails

Failure mode What it costs
Synchronous delete of a hub with millions of edges Tens of seconds → minutes; calling thread blocked; downstream services time out
Sync delete with quorum ack across regions Quorum-RTT × number-of-edges; impractical at scale
Sync delete that fails midway Partial state; need explicit cleanup or retry; no LWW means resurrection risk on retry

When async cascade is wrong

  • Hard deletion deadlines. Regulatory data deletes (GDPR, CCPA) sometimes require deletion to be complete by a fixed deadline; an unbounded async cascade fails this. Mitigation: upper-bound the async pipeline + alert on stalls.
  • Strong consistency on connected reads. A client that just saw a delete and immediately walks the graph from a neighbour expects to not see the deleted node. Async cascade cannot guarantee this until convergence; mitigation: filter on the delete-intent record at read time, not just on the edge state.

Seen in

Last updated · 542 distilled / 1,571 read