PATTERN Cited by 1 source
AST + LLM hybrid conversion¶
Pattern¶
Compose a deterministic AST codemod and a large language model into a single code- conversion pipeline, where:
- The AST pass resolves every case it can handle with rule-based transformations.
- For cases it can't fully resolve, the AST pass writes in-code annotation comments into the partially-converted source — pointers at the call site, suggested replacements, links to relevant docs.
- The LLM receives the original file, the partially- converted file with AST-authored annotations, any runtime context relevant to the target framework (e.g. rendered DOM, sample data, recorded API traces), and a structured prompt.
- The LLM finishes the conversion — reading both the original
intent and the AST's hints, rendering the final file in a
well-defined output format (typically wrapped in delimiters
like
<code></code>). - Downstream, deterministic validators (does it parse? do tests pass? does the test count match the original?) bucket the LLM output by pass-rate for human triage.
Forces¶
- LLM alone on a deterministic code-transformation task hallucinates — 40-60% success rate at Slack's Enzyme→RTL scale with Anthropic Claude 2.1, with wild variance by task complexity (Source: sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library).
- AST alone has a ceiling wherever the correct transformation depends on runtime context the AST cannot see (e.g. rendered DOM for RTL queries, resolved schema for SQL migrations). Slack's AST-only pass topped out at ~45%.
- Prompt engineering alone hits diminishing returns fast: "our attempts to refine prompts had limited success… possibly perplexing the AI model rather than aiding it." Structural scaffolding outperforms prompt micro-optimisation.
- At migration scale (15,000+ tests / 10,000+ engineering hours for Slack), manual conversion is infeasible. Automation has to ship something better than 45%, ideally close to developer-quality.
Mechanism¶
The pattern composes three independently-valid primitives:
- AST pre-pass as both conversion layer and hallucination-control layer. Every case it resolves is one less case the LLM can get wrong; every annotation it writes is a structural constraint on the LLM's decoding path.
- Runtime-artifact context injection — capture whatever runtime information the target framework depends on (DOM, schema, recorded trace) and inject it into the prompt. This eliminates an entire class of hallucination (guessing about runtime state).
- Structured prompt template — three-part (context / tasks / self-evaluate) with explicit delimiters that make the output machine-extractable.
The result — per Slack — is a 20-30% quality lift over pure- LLM prompting: from 40-60% baseline to ~80% on evaluation files.
Production instantiation: Slack Enzyme-to-RTL¶
From systems/enzyme-to-rtl-codemod:
- AST pass handles top-10 Enzyme
methods (
find,prop,simulate,text,update,instance,props,hostNodes,exists,first), custom Jest matchers, query-selector rewrites. For the remaining 55 methods and context-dependent cases, it writes in-code annotation comments with suggestions and doc links. - DOM collection instruments Enzyme's
mountandshallowmethods to capture per-test-casewrapper.html()keyed byexpect.getState().currentTestName. Output is appended to a file consumed by the LLM prompt. - LLM request wraps the original test file in
<code></code>, the AST-partial in<codemod></codemod>, each captured DOM in<component><test_case_title>...</test_case_title> and <dom_tree>...</dom_tree></component>, plus a three-part structured prompt. - LLM is Anthropic Claude 2.1 (2024 era). Output wrapped
in
<code></code>tags. - Downstream validator runs the converted tests, buckets output by pass-rate (fully / 50-99% / 20-49% / <20%) for triage.
- Operational envelope: 2-5 min per file on-demand; CI-nightly over hundreds of files; ~64% adoption across Slack's RTL migration.
Consequences¶
Positive:
- Hybrid dominates either alone: AST-only ~45%, LLM-only 40-60%, hybrid ~80% on Slack's evaluation set.
- Debuggable: when the output is wrong, you can tell which stage failed — AST annotations missing / wrong, DOM capture missing, or LLM hallucinated. Each layer is individually auditable.
- Improvable incrementally: new AST rules reduce the surface the LLM touches, improving quality monotonically.
- Generalises: test migration is one instantiation. Same pattern applies to API migrations, framework upgrades, language ports, SQL dialect translation, config migration.
Negative:
- Three systems to build and maintain — AST codemod, DOM collector (or equivalent runtime instrumentor), LLM pipeline + prompt.
- LLM still hallucinates at the residual ~20%. Human verification remains mandatory; Slack: "the generated code was manually verified by humans before merging into our main repository".
- Quality ceiling is model-dependent — Slack's 80% was Claude-2.1-era; modern frontier models likely hit the ceiling higher, but the 80% number isn't portable as a universal benchmark.
Related¶
- concepts/abstract-syntax-tree — the deterministic pre-pass primitive
- concepts/llm-hallucination — the failure mode this pattern mitigates
- concepts/llm-conversion-hallucination-control — the structural problem class
- concepts/dom-context-injection-for-llm — runtime-artifact context injection
- patterns/in-code-annotation-as-llm-guidance — the mechanism by which the AST pass shapes LLM output
- patterns/llm-plus-planner-validation — sibling pattern: LLM output fed into a deterministic validator (PostgreSQL AI index suggestions)
- patterns/llm-output-as-untrusted-input — downstream consequence: validate before merge
- systems/enzyme-to-rtl-codemod — canonical production instantiation