CONCEPT Cited by 2 sources
Zero-copy data sharing protocol¶
Definition¶
A zero-copy data sharing protocol lets a recipient (another organisation, team, or region) read live tables in the provider's storage without copying the physical data first. The provider does not produce a snapshot, does not ship files, does not run a periodic sync job; the recipient's queries hit the provider's object store directly, mediated by a sharing control plane that handles authorisation, auditing, and format translation.
The canonical modern instance is systems/delta-sharing — Databricks' open protocol (2021) that exposes Delta tables to authorised recipients over a REST API, with the physical data remaining in the provider's object store.
Distinction from intra-process zero-copy¶
Do not confuse with concepts/zero-copy-sharing — that concept covers in-process / on-host zero-copy (Arrow buffers + shared memory between Ray tasks on one node; no serialisation between Python and C++). This concept is at the cross-organisation / cross-cloud boundary — the "zero-copy" property is about avoiding whole-dataset transfers across trust boundaries, not avoiding serialisation between colocated processes.
The two concepts share the naming because they both attack the same underlying tax (the cost of materialising and moving data), but they operate at completely different scales (nanoseconds vs minutes-to-hours) and solve different problems (CPU tax vs storage-duplication + staleness tax).
Why it matters for partner data sharing¶
Before zero-copy protocols, sharing analytical data with external partners typically required one of:
- SFTP / CSV drops — provider generates a snapshot, pushes a file to partner's drop zone. Each drop is a physical copy. Partner has a stale duplicate until next drop.
- S3 buckets with shared keys — similar duplication; shared credentials are a security liability.
- REST APIs — not designed for bulk / analytical workloads; pagination + rate limits turn a 1 TB pull into hours / days.
- Self-service reports / UI downloads — partner-pulled CSVs with the same duplication + staleness tax.
All four paths have a storage-duplication tax (provider-side storage + partner-side storage for the copy) and a synchronisation tax (the copy is stale between refreshes; a sync job bridges the gap and must be operated).
Zero-copy protocols eliminate both taxes: the data stays in the provider's store, partners query it live, and fresh data is available as soon as the provider commits it.
Load-bearing properties¶
-
One copy of the truth. The provider's Delta table is what the partner reads. No diverging snapshots.
-
Incremental semantics for scale. On the Delta Sharing wire, the response is a list of Parquet file references + predicate pushdown hints. Partners who cache locally pull only new / changed files — not a whole re-snapshot. This is what makes a 60TB share feasible; without incremental semantics, every sync would be a 60TB transfer.
-
Time-travel natural. Because the underlying format (Delta Lake) supports time travel, zero-copy shares can offer point-in-time reads without re-materialising old snapshots.
-
Open client ecosystem. Because the protocol is open, partners can use Spark / pandas / Power BI / Tableau / Excel clients against the same share — no client-ecosystem fork.
Access-control model¶
Zero-copy does not mean "open access". The sharing control plane (e.g. systems/unity-catalog for Delta Sharing) enforces:
- Recipient-level identity — each partner has a Recipient (digital identity) with its own credentials.
- Share-level scope — the Recipient is granted access only to specific Shares (logical containers of tables).
- Per-column / per-row predicates — optionally, a Share can expose a filtered view rather than a full table.
- Full audit trail — every read the recipient issues is logged at the control plane altitude.
Trade-offs¶
- Provider pays the read cost. Every query a partner issues hits the provider's compute + storage. Legacy SFTP/CSV pushed that cost to the partner's side after drop. In managed deployments (e.g. Databricks' Delta Sharing service), the provider can set read quotas / throttling per Recipient.
- Cross-cloud / cross-region egress. Live reads across cloud boundaries pay cross-cloud egress per read. Mitigation pattern: pair zero-copy share with a bulk replica cache (patterns/cross-cloud-replica-cache) — partner maintains a local Delta clone refreshed periodically via Deep Clone, and queries hit the local clone.
- Live reads surface provider schema changes immediately. Without coordination, a provider-side schema change breaks partners. Requires versioning discipline (concepts/backward-compatibility) in the share contract.
Seen in¶
- sources/2025-07-07-zalando-direct-data-sharing-using-delta-sharing-introduction-our-journey-to-empower-partners — Zalando Partner Tech chooses Delta Sharing explicitly because of zero-copy semantics. Load-bearing framing: "Partners can work with live datasets without the overhead of constant data transfers." Replaces a fragmented landscape of SFTP / CSV / self-service reports / REST APIs that was costing partners ~1.5 FTE per month in extraction overhead.
- sources/2026-04-20-databricks-mercedes-benz-cross-cloud-data-mesh — Mercedes-Benz uses Delta Sharing across three trust boundaries (cross-cloud AWS↔Azure, cross-region, external partners); zero-copy property is load-bearing for the cross-cloud case, paired with Deep Clone for egress-bounded consumers (patterns/cross-cloud-replica-cache).
Related¶
- systems/delta-sharing — canonical implementation on the wiki.
- systems/unity-catalog — governance / control plane that authorises and audits zero-copy reads.
- systems/delta-lake — underlying table format that provides the incremental + time-travel properties zero-copy sharing exploits.
- concepts/zero-copy-sharing — sibling concept at a different scale (in-process / on-host zero-copy via Arrow).
- concepts/data-lakehouse — the architectural style Delta Sharing is built for.
- patterns/open-protocol-over-proprietary-exchange — the openness property is often paired with zero-copy when the goal is cross-organisation sharing.
- patterns/cross-cloud-replica-cache — mitigation pattern for the cross-cloud egress cost of live zero-copy reads.
- patterns/recipient-per-partner-share-per-dataset-group — deployment primitive that exposes a zero-copy share to external recipients.