Skip to content

PATTERN Cited by 1 source

Configuration-as-code feature pipeline

Pattern

Express sequence + enrichment + event-type definitions as configuration-as-code in a regular programming language (Pinterest uses Python) with a well-defined schema. Validate the configuration, compile to a portable JSON format, store the compiled artefact in managed object storage, and have all runtimes (streaming indexer, batch indexer, online serving) consume the same compiled artefact (Source: sources/2026-05-21-pinterest-making-user-sequence-data-more-cost-efficient-faster-and-easier-to-use).

Shape

   Author config in Python (versioned in VCS)
   Validate against schema (typed; references resolved)
   Compile to portable JSON
   Store in managed object storage (immutable, versioned)
       ┌──────────┼───────────┐
       ▼          ▼           ▼
  Streaming    Batch       Online
   indexer    indexer      serving
   (engine    (engine      (engine
    reads      reads        reads
    JSON)      JSON)        JSON)

Three distinct configuration types in Pinterest's instance:

  • Sequence-feature config — which sequence features exist, naming, owners, retention, lifecycle stage.
  • Event-type config — for each event type: applicable enrichments, filtering logic, source data origins.
  • Enrichment config — how to fetch / derive each signal (e.g. embeddings) and how to map it into the event schema.

Why configuration-as-code (not YAML, not a UI)

Pinterest's choice — Python with a defined schema, compiled to JSON — combines three benefits:

Property Why it matters
Real programming language Type-checked, IDE-supported, composable, supports helpers / shared modules
Defined schema Validation + invariants + cross-reference checks at compile time, before any runtime sees the config
JSON compile target Language-agnostic at runtime, embeddable in any execution engine, immutable once compiled
Object storage Versioned, audit-trailed, accessible from any runtime regardless of where it executes

YAML gives the JSON portability without the type-checking; a UI gives the validation without the diff-friendly version history. Configuration-as-code wins both axes.

Named benefits (from Pinterest)

"New event types or enrichments can now be added primarily through configuration, plus small, isolated pieces of code where absolutely necessary, instead of via entirely new pipelines. That significantly reduces the concept-to-production time for new signals."

"Diffs are human-readable, code owners can review changes, rollbacks are straightforward, and version history lives in standard version control systems."

"ML and product teams focus on what they want (events, features, and filters) while platform teams focus on how to execute that configuration reliably and efficiently."

Three-axis payoff: velocity (config changes vs new pipelines), safety (review + rollback + audit), separation of concerns (what vs how).

What lives in config vs in code

Lives in config Lives in executor code
Which sources to read How to filter raw events
Which enrichments to apply How to compute featurised attrs
Output schema + retention How to map raw → normalised
Owner + lifecycle metadata How to dispatch to enrichment services

If your config starts containing imperative business logic (loops, branching that depends on event payload), the boundary has been violated — that logic belongs in an executor plugin. If your executor starts hardcoding source URLs or schema field names, those should be in config.

When this pattern fits

  • ML feature platforms with many signals, many event types, many enrichments.
  • Multi-tenant data substrates where many teams want to add signals safely.
  • Multi-runtime execution (streaming + batch + serving) where one definition must drive all runtimes — see one definition, many runtimes.

When it doesn't fit

  • Single-tenant prototypes where only one team owns one pipeline — the configuration overhead exceeds the velocity benefit.
  • Pipelines where most logic varies per event type and config can't reasonably abstract it — the executor surface dominates and config-as-code becomes a thin shim.
  • Stacks where the runtime can't ingest a portable config format (no shared engine, no JSON parser, no plugin framework).

Adjacent / sibling patterns

Caveats

  • Schema evolution is hard. Once many tenants depend on the config schema, breaking changes need versioning + migration plans of their own. "Configuration as code" doesn't dodge this; it just makes the schema's authoring surface programmable.
  • Compilation is a critical path. A buggy compiler corrupts the substrate for every runtime. Pinterest doesn't disclose how they version the compiler vs the configs; in practice both need lockstep deploy.
  • Object-storage immutability isn't free. Each compile creates a new artefact; rollbacks require swapping the runtime's pointer + invalidating any cached config in long-running streaming jobs.

Seen in

Last updated · 542 distilled / 1,571 read