SYSTEM Cited by 1 source
VF Match¶
VF Match (vfmatch.org) is Virtue Foundation's production volunteer-matching marketplace — the substrate that connects medical professionals to volunteer opportunities in 72 low and low-middle income countries. The platform's value depends on a comprehensive catalog of healthcare facilities and NGOs across those countries, which does not exist as a single public dataset; VF Match builds and maintains it via the Foundational Data Refresh (FDR) pipeline on Databricks.
Stub page. First wiki disclosure 2026-05-20 via the Databricks co-marketing post.
Mission context¶
Virtue Foundation is a non-profit focused on global health delivery and creating an efficient marketplace for global philanthropic healthcare. To date, they've delivered care to 50,000+ patients with a special focus on Ghana and Mongolia. VF Match is the matching layer between supply (medical volunteers) and demand (under-resourced facilities in 72 countries).
Architecture overview (production system)¶
VF Match's data layer is the Foundational Data Refresh (FDR) — the LLM-extraction + entity-resolution pipeline that aggregates healthcare-facility and NGO records from web-scale sources into a catalog the matching system queries.
Data sources (two complementary)¶
- Overture Maps — Meta + Microsoft open-source geospatial dataset. Provides authoritative locations for healthcare facilities. Deduplicated, cross-validated geometric ground truth; the "where" anchor.
- Bright Data — industrial web-scraping infrastructure. Captures real-time information from facility / NGO web pages. The "what" + "now" anchor — current operating status, specialties offered, equipment.
Pipeline shape (FDR)¶
- Ingestion — Overture Maps geospatial records + Bright Data scraped pages land as raw inputs.
- Multi-step LLM extraction (concepts/multi-step-llm-extraction) — instead of one-shot extraction, the pipeline decomposes into targeted GPT calls: classify medical relevance, identify organisation type (facility vs NGO), then extract specialties / equipment / procedures. 25M+ web pages processed.
- Status-based checkpointing — each record's processing state is tracked in a star schema, so re-runs resume from the failure point without paying the LLM cost of already-processed rows.
- Configurable extraction registry — each extraction method is a structured object (system prompt + extraction schema), making extraction logic modular, reproducible, and extensible.
- Entity resolution via Splink — the same facility shows up in multiple sources with name / address / contact variations; Splink's probabilistic record linkage produces a unified key per facility.
- Orchestration via Lakeflow Jobs — 15+ interdependent tasks with conditional branching, parallel execution, intelligent retry policies.
Performance milestones¶
- The Splink ER stage hit the canonical straggler partition problem on Spark: 30 minutes worst-case partition vs 52 seconds median.
- Enabling Photon (Databricks' vectorised query engine) cut worst-case partitions to ~2 minutes — a 15× improvement.
Forward-looking layer (prototype)¶
VF Agent is a multi-agent natural-language query system on top of FDR. Built in LangGraph with four named agents (Medical Specialty Extractor + Multi-Agent Supervisor + Vector Search Agent + Genie Agent) so healthcare professionals can ask in natural language "find me orthopedic volunteer opportunities in Ghana with X-ray equipment available" and get matches without a SQL skill prerequisite.
Architectural lessons VF Match canonicalises¶
- Multi-step LLM extraction beats one-shot at production scale. "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task." See patterns/multi-step-llm-extraction-pipeline.
- Status-column checkpointing is the idempotency primitive for per-record-LLM-cost pipelines. The dollar cost of re-extracting rows is what makes resumability load-bearing — see concepts/status-based-llm-pipeline-checkpointing.
- Open-source ER frameworks (Splink) make non-profit-scale ER feasible. Closed-source ER stacks would have created a per-record-licence cost structure incompatible with the funding model.
- Photon vectorisation absorbs ER-shape skew at 15× ratio. The 30 min → 2 min observation is the wiki's first-quantified instance of systems/photon applied to ER pairwise-comparison skew, distinct from prior Photon mentions at OLAP-query altitude.
- Two complementary data sources (geospatial authority + web evidence) resolve healthcare-facility ground truth. Either alone is insufficient: Overture lacks operational details; scraped pages lack authoritative geometry. Joining them via Splink ER produces records anchored on both axes.
Caveats¶
- Production-system claims are mostly qualitative. The post asserts the architecture works without disclosing end-to-end latency, accuracy, recall, or cost per facility ingested.
- Coverage / completeness not measured. The post does not quantify what fraction of actual facilities in the 72 countries the FDR pipeline captures.
- GPT model versions and prompt content for each of the three extraction steps are not disclosed.
Seen in¶
- sources/2026-05-20-databricks-virtue-foundation-medical-volunteers-72-countries — Canonical wiki source. VF Match as the matching marketplace whose data layer is rebuilt as a production-grade FDR pipeline on Databricks. Iteration on a 2024 proof-of-concept (referenced as "Elevating Global Health: Databricks and Virtue Foundation").
Related¶
- systems/vf-agent — the natural-language-query layer on top of VF Match's FDR data.
- systems/splink — the entity-resolution stage that produces unified keys per facility.
- systems/photon — the vectorised engine that absorbs Splink's partition skew.
- systems/lakeflow-jobs — orchestration substrate.
- systems/apache-spark — distributed-execution substrate.
- systems/overture-maps · systems/bright-data — the two complementary data sources feeding FDR.
- companies/databricks — the host platform; VF Match is a Databricks-for-Good reference deployment.
- concepts/entity-resolution · patterns/multi-step-llm-extraction-pipeline · concepts/multi-step-llm-extraction — central architectural shapes.