Skip to content

PATTERN Cited by 1 source

AST + LLM hybrid conversion

Pattern

Compose a deterministic AST codemod and a large language model into a single code- conversion pipeline, where:

  1. The AST pass resolves every case it can handle with rule-based transformations.
  2. For cases it can't fully resolve, the AST pass writes in-code annotation comments into the partially-converted source — pointers at the call site, suggested replacements, links to relevant docs.
  3. The LLM receives the original file, the partially- converted file with AST-authored annotations, any runtime context relevant to the target framework (e.g. rendered DOM, sample data, recorded API traces), and a structured prompt.
  4. The LLM finishes the conversion — reading both the original intent and the AST's hints, rendering the final file in a well-defined output format (typically wrapped in delimiters like <code></code>).
  5. Downstream, deterministic validators (does it parse? do tests pass? does the test count match the original?) bucket the LLM output by pass-rate for human triage.

Forces

  • LLM alone on a deterministic code-transformation task hallucinates — 40-60% success rate at Slack's Enzyme→RTL scale with Anthropic Claude 2.1, with wild variance by task complexity (Source: sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library).
  • AST alone has a ceiling wherever the correct transformation depends on runtime context the AST cannot see (e.g. rendered DOM for RTL queries, resolved schema for SQL migrations). Slack's AST-only pass topped out at ~45%.
  • Prompt engineering alone hits diminishing returns fast: "our attempts to refine prompts had limited success… possibly perplexing the AI model rather than aiding it." Structural scaffolding outperforms prompt micro-optimisation.
  • At migration scale (15,000+ tests / 10,000+ engineering hours for Slack), manual conversion is infeasible. Automation has to ship something better than 45%, ideally close to developer-quality.

Mechanism

The pattern composes three independently-valid primitives:

  1. AST pre-pass as both conversion layer and hallucination-control layer. Every case it resolves is one less case the LLM can get wrong; every annotation it writes is a structural constraint on the LLM's decoding path.
  2. Runtime-artifact context injection — capture whatever runtime information the target framework depends on (DOM, schema, recorded trace) and inject it into the prompt. This eliminates an entire class of hallucination (guessing about runtime state).
  3. Structured prompt template — three-part (context / tasks / self-evaluate) with explicit delimiters that make the output machine-extractable.

The result — per Slack — is a 20-30% quality lift over pure- LLM prompting: from 40-60% baseline to ~80% on evaluation files.

Production instantiation: Slack Enzyme-to-RTL

From systems/enzyme-to-rtl-codemod:

  • AST pass handles top-10 Enzyme methods (find, prop, simulate, text, update, instance, props, hostNodes, exists, first), custom Jest matchers, query-selector rewrites. For the remaining 55 methods and context-dependent cases, it writes in-code annotation comments with suggestions and doc links.
  • DOM collection instruments Enzyme's mount and shallow methods to capture per-test-case wrapper.html() keyed by expect.getState().currentTestName. Output is appended to a file consumed by the LLM prompt.
  • LLM request wraps the original test file in <code></code>, the AST-partial in <codemod></codemod>, each captured DOM in <component><test_case_title>...</test_case_title> and <dom_tree>...</dom_tree></component>, plus a three-part structured prompt.
  • LLM is Anthropic Claude 2.1 (2024 era). Output wrapped in <code></code> tags.
  • Downstream validator runs the converted tests, buckets output by pass-rate (fully / 50-99% / 20-49% / <20%) for triage.
  • Operational envelope: 2-5 min per file on-demand; CI-nightly over hundreds of files; ~64% adoption across Slack's RTL migration.

Consequences

Positive:

  • Hybrid dominates either alone: AST-only ~45%, LLM-only 40-60%, hybrid ~80% on Slack's evaluation set.
  • Debuggable: when the output is wrong, you can tell which stage failed — AST annotations missing / wrong, DOM capture missing, or LLM hallucinated. Each layer is individually auditable.
  • Improvable incrementally: new AST rules reduce the surface the LLM touches, improving quality monotonically.
  • Generalises: test migration is one instantiation. Same pattern applies to API migrations, framework upgrades, language ports, SQL dialect translation, config migration.

Negative:

  • Three systems to build and maintain — AST codemod, DOM collector (or equivalent runtime instrumentor), LLM pipeline + prompt.
  • LLM still hallucinates at the residual ~20%. Human verification remains mandatory; Slack: "the generated code was manually verified by humans before merging into our main repository".
  • Quality ceiling is model-dependent — Slack's 80% was Claude-2.1-era; modern frontier models likely hit the ceiling higher, but the 80% number isn't portable as a universal benchmark.
Last updated · 470 distilled / 1,213 read