PATTERN Cited by 1 source
Daily S3 vectorstore update pipeline¶
Pattern shape¶
Keep a RAG vectorstore fresh on a daily cadence via a scheduled batch job that:
- Fetches the latest documents from an internal source-of-truth endpoint.
- Extracts the candidate text and metadata, normalises into a
CSV of
(article_id, title, summary, header_text, ...)rows. - Uploads the CSV to AWS S3 (or any object store).
- Each chatbot container, on startup, downloads the CSV, constructs the FAISS index, computes fresh embeddings, and loads the vectorstore into memory during the health-check process.
The pattern is batch-not-stream, with data-not-index as the durable artifact in S3.
Canonical instance — Yelp CS Chatbot (2026-05-27)¶
Verbatim from the post:
"Since our Support Center articles are frequently updated, maintaining a current knowledge base is critical. We established an automated daily update pipeline using a scheduled batch job: 1. Fetching and Processing: The job fetches updated articles from an internal endpoint. It converts the articles to markdown format, extracts the necessary headers, and constructs a CSV file containing all the candidate text and metadata. 2. Storage: This generated CSV file is uploaded to AWS S3 daily. 3. Loading: When the chatbot's container starts, it downloads the latest CSV from S3. The vectorstore, including the construction of the index and calculation of fresh embeddings for the newly fetched articles and metadata, is then dynamically loaded into memory during the health check process, ensuring the system operates with the freshest information." (Source: sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot)
Three structural choices¶
- Batch-not-stream. Daily, not minute-level, not realtime. Acceptable when the corpus changes on the order of days (Support Center articles updated by humans, not by an event source).
- Data-not-index in S3. The durable artifact is the CSV of source data, not the FAISS index file. The FAISS index is rebuilt from CSV every container start. Avoids the index-format-versioning problem (FAISS index format evolution doesn't break stored artifacts).
- Load-at-health-check. Not lazy-on-first-request. The container does not advertise as healthy until the vectorstore is loaded; first user request hits a fully-warm vectorstore.
Pipeline flow¶
┌─────────────────────────┐
[Daily cron] ──▶ │ Internal docs endpoint │
└────────────┬────────────┘
│ Fetch updated articles
▼
┌─────────────────────────┐
│ Extract metadata, │
│ convert to markdown, │
│ build CSV │
└────────────┬────────────┘
│ s3:PutObject
▼
┌─────────────────────────┐
│ S3 bucket │
│ /vectorstore/latest.csv│
└────────────┬────────────┘
[Container start] ─────┬──────▶│
│ │ s3:GetObject
│ ▼
│ ┌─────────────────────────┐
│ │ Download CSV │
│ ├─────────────────────────┤
│ │ For each row: │
│ │ ada-002 embed segment │
│ │ add to FAISS index │
│ ├─────────────────────────┤
│ │ FAISS smart indexing + │
│ │ quantization │
│ ├─────────────────────────┤
│ │ Vectorstore in memory │
│ └─────────────────────────┘
│ │
│ ▼
│ ┌─────────────────────────┐
│ │ Container advertises │
│ │ as HEALTHY │
│ └─────────────────────────┘
│
▼
┌─────────────────────────┐
│ Serving requests │
└─────────────────────────┘
Why this beats streaming-pipeline updates¶
- Operational simplicity. A daily cron + S3 upload + CSV parse on container start is dramatically simpler than a streaming embedding-update pipeline (CDC → Kafka → embed worker → vector DB upsert).
- Cost. S3 storage is pennies/month for an 8 MB CSV. Daily batch + container-start embed is much cheaper than per-update streaming compute.
- Failure containment. A failed batch job means yesterday's CSV is still in S3; new containers boot from the most-recent successful upload. No partial-update inconsistency.
- Deploy semantics. Every container deploy effectively also deploys the latest vectorstore. No separate vector-DB schema migration step.
When to apply¶
Use when:
- Source corpus changes on daily-or-slower cadence.
- Corpus is small enough that re-embedding the entire CSV on each container start is fast (Yelp: ~370 × ~5 = ~1,850 embeds; with ada-002 batched API calls, ~seconds).
- Per-container in-memory vectorstore (patterns/in-memory-vectorstore-loaded-at-container-start) is the chosen retrieval architecture.
Don't use when:
- Source corpus changes sub-daily — staleness window too large.
- Corpus is large (millions of documents) — re-embedding cost per container start becomes prohibitive; need incremental update.
- Container starts are frequent (autoscaling, blue-green deploys, fast rollouts) — every restart pays the embed cost; better to persist the index alongside the CSV.
Risks¶
- Stale-data window. Up to 24h between Support Center article update and vectorstore reflection. Yelp accepts this trade-off; products with tighter freshness needs require more frequent batches or streaming.
- Daily-batch failure → stale-CSV-fallback. If today's batch job fails, new containers boot from yesterday's CSV. Need: alerting on missed batch runs.
- Embedding-API rate limit at container-start. A burst of container starts (e.g. mass autoscale-out) drives a burst of ada-002 embedding API calls. Mitigate: in-process embedding cache that survives across containers (S3 cache?) or stagger container starts.
- CSV schema drift. Container code expecting
summarycolumn when CSV shipsdescriptioncolumn → silent failure. Need: schema validation on CSV load. - Index-build CPU on container start. Container's first N seconds of life are CPU-spent on FAISS index build. If container startup is on the critical path of an autoscale response, the index-build cost is added latency; acceptable for slow-scale chatbot workloads.
Composes with¶
- patterns/in-memory-vectorstore-loaded-at-container-start — the in-memory residency choice is what enables container-start rebuild.
- patterns/whole-article-retrieval-via-metadata-segments — metadata-only-embedding is what makes the daily-batch CSV small (8 MB).
Seen in¶
- sources/2026-05-27-yelp-beyond-the-menu-tree-how-yelp-built-a-smarter-customer-success-chatbot — canonical: scheduled batch → CSV → AWS S3 → container-start download + index build at health-check time.
Related¶
- concepts/retrieval-augmented-generation — the parent shape.
- systems/aws-s3 — durable artifact tier.
- systems/faiss — in-process ANN library substrate.
- systems/yelp-cs-chatbot — canonical wiki instance.