Skip to content

SYSTEM Cited by 1 source

VF Match

VF Match (vfmatch.org) is Virtue Foundation's production volunteer-matching marketplace — the substrate that connects medical professionals to volunteer opportunities in 72 low and low-middle income countries. The platform's value depends on a comprehensive catalog of healthcare facilities and NGOs across those countries, which does not exist as a single public dataset; VF Match builds and maintains it via the Foundational Data Refresh (FDR) pipeline on Databricks.

Stub page. First wiki disclosure 2026-05-20 via the Databricks co-marketing post.

Mission context

Virtue Foundation is a non-profit focused on global health delivery and creating an efficient marketplace for global philanthropic healthcare. To date, they've delivered care to 50,000+ patients with a special focus on Ghana and Mongolia. VF Match is the matching layer between supply (medical volunteers) and demand (under-resourced facilities in 72 countries).

Architecture overview (production system)

VF Match's data layer is the Foundational Data Refresh (FDR) — the LLM-extraction + entity-resolution pipeline that aggregates healthcare-facility and NGO records from web-scale sources into a catalog the matching system queries.

Data sources (two complementary)

  • Overture Maps — Meta + Microsoft open-source geospatial dataset. Provides authoritative locations for healthcare facilities. Deduplicated, cross-validated geometric ground truth; the "where" anchor.
  • Bright Data — industrial web-scraping infrastructure. Captures real-time information from facility / NGO web pages. The "what" + "now" anchor — current operating status, specialties offered, equipment.

Pipeline shape (FDR)

  1. Ingestion — Overture Maps geospatial records + Bright Data scraped pages land as raw inputs.
  2. Multi-step LLM extraction (concepts/multi-step-llm-extraction) — instead of one-shot extraction, the pipeline decomposes into targeted GPT calls: classify medical relevance, identify organisation type (facility vs NGO), then extract specialties / equipment / procedures. 25M+ web pages processed.
  3. Status-based checkpointing — each record's processing state is tracked in a star schema, so re-runs resume from the failure point without paying the LLM cost of already-processed rows.
  4. Configurable extraction registry — each extraction method is a structured object (system prompt + extraction schema), making extraction logic modular, reproducible, and extensible.
  5. Entity resolution via Splink — the same facility shows up in multiple sources with name / address / contact variations; Splink's probabilistic record linkage produces a unified key per facility.
  6. Orchestration via Lakeflow Jobs — 15+ interdependent tasks with conditional branching, parallel execution, intelligent retry policies.

Performance milestones

  • The Splink ER stage hit the canonical straggler partition problem on Spark: 30 minutes worst-case partition vs 52 seconds median.
  • Enabling Photon (Databricks' vectorised query engine) cut worst-case partitions to ~2 minutes — a 15× improvement.

Forward-looking layer (prototype)

VF Agent is a multi-agent natural-language query system on top of FDR. Built in LangGraph with four named agents (Medical Specialty Extractor + Multi-Agent Supervisor + Vector Search Agent + Genie Agent) so healthcare professionals can ask in natural language "find me orthopedic volunteer opportunities in Ghana with X-ray equipment available" and get matches without a SQL skill prerequisite.

Architectural lessons VF Match canonicalises

  • Multi-step LLM extraction beats one-shot at production scale. "This approach dramatically reduces token consumption while focusing each model invocation on a narrow, high-precision task." See patterns/multi-step-llm-extraction-pipeline.
  • Status-column checkpointing is the idempotency primitive for per-record-LLM-cost pipelines. The dollar cost of re-extracting rows is what makes resumability load-bearing — see concepts/status-based-llm-pipeline-checkpointing.
  • Open-source ER frameworks (Splink) make non-profit-scale ER feasible. Closed-source ER stacks would have created a per-record-licence cost structure incompatible with the funding model.
  • Photon vectorisation absorbs ER-shape skew at 15× ratio. The 30 min → 2 min observation is the wiki's first-quantified instance of systems/photon applied to ER pairwise-comparison skew, distinct from prior Photon mentions at OLAP-query altitude.
  • Two complementary data sources (geospatial authority + web evidence) resolve healthcare-facility ground truth. Either alone is insufficient: Overture lacks operational details; scraped pages lack authoritative geometry. Joining them via Splink ER produces records anchored on both axes.

Caveats

  • Production-system claims are mostly qualitative. The post asserts the architecture works without disclosing end-to-end latency, accuracy, recall, or cost per facility ingested.
  • Coverage / completeness not measured. The post does not quantify what fraction of actual facilities in the 72 countries the FDR pipeline captures.
  • GPT model versions and prompt content for each of the three extraction steps are not disclosed.

Seen in

Last updated · 542 distilled / 1,571 read