SYSTEM Cited by 1 source

VF Match¶

VF Match (vfmatch.org) is Virtue Foundation's production volunteer-matching marketplace — the substrate that connects medical professionals to volunteer opportunities in 72 low and low-middle income countries. The platform's value depends on a comprehensive catalog of healthcare facilities and NGOs across those countries, which does not exist as a single public dataset; VF Match builds and maintains it via the Foundational Data Refresh (FDR) pipeline on Databricks.

Stub page. First wiki disclosure 2026-05-20 via the Databricks co-marketing post.

Mission context¶

Virtue Foundation is a non-profit focused on global health delivery and creating an efficient marketplace for global philanthropic healthcare. To date, they've delivered care to 50,000+ patients with a special focus on Ghana and Mongolia. VF Match is the matching layer between supply (medical volunteers) and demand (under-resourced facilities in 72 countries).

Architecture overview (production system)¶

VF Match's data layer is the Foundational Data Refresh (FDR) — the LLM-extraction + entity-resolution pipeline that aggregates healthcare-facility and NGO records from web-scale sources into a catalog the matching system queries.

Data sources (two complementary)¶

Overture Maps — Meta + Microsoft open-source geospatial dataset. Provides authoritative locations for healthcare facilities. Deduplicated, cross-validated geometric ground truth; the "where" anchor.
Bright Data — industrial web-scraping infrastructure. Captures real-time information from facility / NGO web pages. The "what" + "now" anchor — current operating status, specialties offered, equipment.

Pipeline shape (FDR)¶

Ingestion — Overture Maps geospatial records + Bright Data scraped pages land as raw inputs.
Multi-step LLM extraction (concepts/multi-step-llm-extraction) — instead of one-shot extraction, the pipeline decomposes into targeted GPT calls: classify medical relevance, identify organisation type (facility vs NGO), then extract specialties / equipment / procedures. 25M+ web pages processed.
Status-based checkpointing — each record's processing state is tracked in a star schema, so re-runs resume from the failure point without paying the LLM cost of already-processed rows.
Configurable extraction registry — each extraction method is a structured object (system prompt + extraction schema), making extraction logic modular, reproducible, and extensible.
Entity resolution via Splink — the same facility shows up in multiple sources with name / address / contact variations; Splink's probabilistic record linkage produces a unified key per facility.
Orchestration via Lakeflow Jobs — 15+ interdependent tasks with conditional branching, parallel execution, intelligent retry policies.

Performance milestones¶

The Splink ER stage hit the canonical straggler partition problem on Spark: 30 minutes worst-case partition vs 52 seconds median.
Enabling Photon (Databricks' vectorised query engine) cut worst-case partitions to ~2 minutes — a 15× improvement.

Forward-looking layer (prototype)¶

VF Agent is a multi-agent natural-language query system on top of FDR. Built in LangGraph with four named agents (Medical Specialty Extractor + Multi-Agent Supervisor + Vector Search Agent + Genie Agent) so healthcare professionals can ask in natural language "find me orthopedic volunteer opportunities in Ghana with X-ray equipment available" and get matches without a SQL skill prerequisite.

Architectural lessons VF Match canonicalises¶

Multi-step LLM extraction beats one-shot at production scale. "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task." See patterns/multi-step-llm-extraction-pipeline.
Status-column checkpointing is the idempotency primitive for per-record-LLM-cost pipelines. The dollar cost of re-extracting rows is what makes resumability load-bearing — see concepts/status-based-llm-pipeline-checkpointing.
Open-source ER frameworks (Splink) make non-profit-scale ER feasible. Closed-source ER stacks would have created a per-record-licence cost structure incompatible with the funding model.
Photon vectorisation absorbs ER-shape skew at 15× ratio. The 30 min → 2 min observation is the wiki's first-quantified instance of systems/photon applied to ER pairwise-comparison skew, distinct from prior Photon mentions at OLAP-query altitude.
Two complementary data sources (geospatial authority + web evidence) resolve healthcare-facility ground truth. Either alone is insufficient: Overture lacks operational details; scraped pages lack authoritative geometry. Joining them via Splink ER produces records anchored on both axes.

Caveats¶

Production-system claims are mostly qualitative. The post asserts the architecture works without disclosing end-to-end latency, accuracy, recall, or cost per facility ingested.
Coverage / completeness not measured. The post does not quantify what fraction of actual facilities in the 72 countries the FDR pipeline captures.
GPT model versions and prompt content for each of the three extraction steps are not disclosed.

Seen in¶

sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — Canonical wiki source. VF Match as the matching marketplace whose data layer is rebuilt as a production-grade FDR pipeline on Databricks. Iteration on a 2024 proof-of-concept (referenced as "Elevating Global Health: Databricks and Virtue Foundation").

systems/vf-agent — the natural-language-query layer on top of VF Match's FDR data.
systems/splink — the entity-resolution stage that produces unified keys per facility.
systems/photon — the vectorised engine that absorbs Splink's partition skew.
systems/lakeflow-jobs — orchestration substrate.
systems/apache-spark — distributed-execution substrate.
systems/overture-maps · systems/bright-data — the two complementary data sources feeding FDR.
companies/databricks — the host platform; VF Match is a Databricks-for-Good reference deployment.
concepts/entity-resolution · patterns/multi-step-llm-extraction-pipeline · concepts/multi-step-llm-extraction — central architectural shapes.