Skip to content

PATTERN Cited by 1 source

Asynchronous cascade delete for high-fanout graph nodes

Asynchronous cascade delete for high-fanout graph nodes is a graph-database deletion pattern: when a node with potentially millions of connected edges is deleted, the system acks the delete to the caller immediately, then asynchronously removes the connected edges and any direction-paired adjacency-list entries. LWW on timestamped idempotency tokens is the load-bearing primitive that makes the pattern correct against concurrent updates.

Why synchronous cascade delete fails

A property graph hub node with millions of connected edges forces a synchronous cascade delete to:

  • Issue millions of edge deletes before responding to the caller.
  • Block the caller for tens of seconds to minutes, blowing every reasonable API SLA.
  • Coordinate atomicity across multi-namespace deletes (forward link, reverse link, edge properties) — without a distributed transaction substrate, partial cascades leave dangling state.
  • Risk timing out anyway, leaving the graph in a partially- deleted state that's worse than no delete at all.

For OLTP-graph workloads the sync option is non-viable. Verbatim from Netflix Graph Abstraction Part I:

"Deleting nodes in a highly connected graph is more complex than simply removing a KV record as each node may have thousands of connected edges that must be handled to maintain graph integrity. Further, synchronously deleting all such connections would introduce unacceptable latency for the Abstraction callers."

Structure

┌─────────────────────────────────────────────────────────────┐
│ caller: delete(node N) at intent time T_d                   │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ graph layer (synchronous portion)                           │
│  1. write delete-intent record for N at idempotency_token=T_d│
│  2. enqueue async cascade job for N at T_d                  │
│  3. ack to caller (sub-second)                              │
└────────────────────────┬────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ async cascade worker                                        │
│  for each forward edge e adjacent to N:                     │
│    forward_link.delete(e, idempotency_token=T_d)            │
│    reverse_link.delete(e_reverse, idempotency_token=T_d)    │
│    edge_property.delete(direction_agnostic_id, t=T_d)       │
│  for each property p of N:                                  │
│    node.property.delete(p, idempotency_token=T_d)           │
│                                                             │
│  on any partial failure: republish to Kafka entropy-repair  │
│  queue → consumer retries until convergence                 │
└─────────────────────────────────────────────────────────────┘
        Eventually-consistent fully-cleaned-up state
        (typically sub-second total trajectory per Netflix)

LWW makes the race resolve correctly

The structural challenge: a concurrent update against the to-be- deleted node, racing the in-flight cascade, must not resurrect the deleted state.

Verbatim from the post:

"Further, to ensure correctness of asynchronous deletes during concurrent updates, the Last-Write-Wins (LWW) conflict resolution mechanism is essential."

The mechanism: every per-edge delete in the cascade carries the delete-intent timestamp T_d, not the wall-clock time of when the cascade fired. Three race outcomes resolve as follows:

Race outcome Update timestamp T_u vs T_d LWW result
Update happened-before delete T_u < T_d Delete wins; update is wiped
Update concurrent with delete in real time, but issued before delete intent T_u < T_d Delete wins (deterministic ordering by token timestamp)
Update issued after delete intent (semantically should fail at the application level) T_u > T_d Update wins on the edge; application is responsible for treating "update on deleted node" as an error

The third case is the application's responsibility; the graph substrate provides only the ordering invariant. Netflix doesn't detail how the application surfaces this; LWW only guarantees deterministic convergence, not semantically-correct convergence for every application.

Disclosed observed latency

"Asynchronous operations such as node deletions can be slightly latent, but typically perform with sub-second latency."

So the full async cascade trajectory (caller-ack to fully converged state across all connected edges) typically completes sub-second. The post does not disclose worst-case latency for the most extreme hub nodes (millions of edges).

Composition

The pattern composes with:

When to use

  • Graph workloads with high-fanout hub nodes.
  • Caller-facing APIs with sub-second latency budgets on delete operations.
  • Graphs where eventual consistency is acceptable (the wiki's default OLTP-graph framing).

When not to use

  • Hard regulatory deletion deadlines (GDPR / CCPA complete by T+24h) where unbounded async latency could fail compliance. Mitigation: bound the async pipeline + alert on stalls.
  • Graphs where strong consistency on connected reads is required (a client that just deleted X expects to see no references to X immediately, even on traversals from unrelated starting nodes).
  • Graphs with strict atomicity guarantees from the substrate (rare for KV-substrate property graphs).

Risks

  • Unbounded cascade tails. The post says "typically sub- second"; worst-case fanout is undisclosed. Operationally hub-node deletes need their own latency dashboards.
  • In-flight visibility. A reader that walks into a partially- cascaded subgraph briefly sees half-deleted state. Mitigation: filter on the delete-intent record, not just on the per-edge state, so the reader knows to skip deleted nodes' neighbours.
  • Async pipeline backlog. A spike of high-fanout deletes saturates the cascade workers and lengthens trail latency. Standard back-pressure / queue-depth observability applies.

Seen in

Last updated · 542 distilled / 1,571 read