SLACK 2024-06-19 Tier 2

Slack — AI-Powered Conversion from Enzyme to React Testing Library at Slack¶

Summary¶

Slack's Frontend Test Frameworks team retrospective on migrating 15,000+ frontend Enzyme tests to React Testing Library (RTL) as part of a React 18 upgrade — Enzyme has no React 18 adapter (author Wojciech Maj explicitly told the ecosystem "you should consider looking for Enzyme alternative right now"). The migration is framed as 10,000+ potential engineering hours at unchanged stack shape, so the team invested in automation. After sequentially hitting the ceilings of pure-AST transformations (~45% success) and pure-LLM prompting with Anthropic Claude 2.1 (40-60%, wildly inconsistent), they converged on a hybrid AST + LLM pipeline that fed the LLM four distinct context sources — partially- converted code from the AST codemod, the actual rendered-component DOM tree captured by instrumenting Enzyme's render methods, the original test file, and a heavily-structured prompt — and lifted selected-file conversion quality to ~80%. The tool ran on- demand (2-5 minutes per file) plus CI-nightly over hundreds of files. The post discloses real adoption numbers (~64% of converted files used the tool) and a concrete productivity-savings datum (22% developer time saved, ~500 of 2,300 test cases auto-passing across 338 files).

The tool was later open-sourced as @slack/enzyme-to-rtl-codemod.

This is a borderline Tier-2 ingest: developer-productivity / test-migration rather than distributed-systems internals, but the post canonicalises a reusable architectural primitive — the AST-plus-LLM hybrid conversion pipeline — at real production scale with disclosed numbers, and articulates a structural mechanism (in-code annotations authored by the AST pass as guidance for the LLM) that generalises well beyond test migration to any large-scale code transformation task.

Key takeaways¶

Scale motivated automation: 15,000+ Enzyme tests ≈ 10,000+ engineering hours. Verbatim: "Our initiative began with a monumental task of converting more than 15,000 Enzyme test cases, which translated to more than 10,000 potential engineering hours. At that scale with that many engineering hours required, it was almost obligatory to optimize and automate that process." Canonical data-point for when an organisation decides migration automation is worth building from scratch vs bulk-manual-conversion. (Source: sources/2024-06-19-slack-ai-powered-conversion-from-enzyme-to-react-testing-library)
Method-frequency distribution concentrated at the top: 10 methods account for most of the call sites. Slack counted find (13,244 calls), prop (3,050), simulate (2,755), text (2,181), update (2,147), instance (1,549), props (1,522), hostNodes (1,477), exists (1,174), first (684) — followed by 55 more methods. The long-tail shape (10 methods dominant, 65 total) is what made rule-based-only conversion infeasible: "With each new transformation rule we tackled, the difficulty seemed to escalate."
Pure AST transformation hit a ceiling at ~45% auto-convertible. systems/enzyme and RTL use structurally different testing methodologies — Enzyme tests against the React component instance (wrapper.find(...), .instance(), .props()), RTL tests against the rendered DOM (getByRole, getByTestId). The right RTL query depends on context the AST cannot see: the actual DOM of the rendered component ("the choice between getByRole and getByTestId depended on the accessibility roles or test IDs present in the rendered component. However, AST lacks the capability to incorporate such contextual information"). The AST pass was shipped as a useful but limited first-pass tool.
Pure-LLM conversion with Claude 2.1 was inconsistent: 40-60% success, highly variance by task complexity. Slack collaborated with Slack's DevXP AI team on the integration, but "our attempts to refine prompts had limited success… possibly perplexing the AI model rather than aiding it." Hallucination and erratic output are the named failure modes the hybrid design exists to control.
The hybrid pipeline won by injecting four distinct context sources into the LLM request. The insight came from watching humans: "Humans benefit from valuable insights taken from various sources, including the rendered React component DOM, React component code (often authored by the same individuals), AST conversions, and extensive experience with frontend technologies." The team fed the LLM: (a) the original Enzyme test file, (b) the AST-codemod-partial conversion (including in-code annotation comments authored by the AST pass pointing at problems), (c) the actual rendered DOM tree per test case (captured by overriding Enzyme's mount/shallow methods at test time), (d) a three-part structured prompt (context / 10-step mandatory tasks + 7 optional / evaluation-and-presentation instructions). Named result: ~80% quality on selected test files — "a notable 20-30% improvement beyond the capabilities of our LLM model out-of-the-box."
DOM collection was mechanised by overriding Enzyme's render methods. Verbatim pattern: use beforeEach to capture expect.getState().currentTestName, wrap the original mount/shallow, append the test name + wrapper.html() to a DOM-tree file keyed by environment variable, consume the file in the pipeline as the DOM context for that test case. The mechanism captures per-test-case DOM because "each test case might have different setups and properties passed to the component, resulting in varying DOM structures". Canonicalised on the wiki as concepts/dom-context-injection-for-llm.
AST also acts as a hallucination-control mechanism — not just a conversion layer. The load-bearing architectural insight is the reverse: "Instead of solely relying on prompt engineering, we integrated the partially converted code and suggestions generated by our initial AST-based codemod. The inclusion of AST-converted code in our requests yielded remarkable results. By automating the conversion of simpler cases and providing annotations for all other instances through comments in the converted file, we successfully minimized hallucinations and nonsensical conversions from the LLM." The AST pre-processes the easy cases (taking them off the LLM's table entirely) and for hard cases writes in-code annotation comments that constrain the LLM's attention and output. Canonicalised as patterns/in-code-annotation-as-llm-guidance.
Operational envelope was designed for both on-demand and CI-nightly usage. On-demand runs: 2-5 minutes per file (engineers iterate locally). CI nightly: hundreds of files per run, output bucketed by pass-rate (fully-converted, 50-99% passing, 20-49%, <20%) so developers could pick the low-hanging fruit first. The bucketing is load-bearing — "allowed developers to easily identify and use the most effectively converted files."
Disclosed production outcomes: ~64% adoption (fraction of migrated files that passed through the codemod); selected- file quality benchmarked at 80% auto-converted, 20% manual across 9 representative files (3 easy + 3 medium + 3 complex), benchmarked against human-written RTL output; at-scale run over 338 files / ~2,300 test cases produced ~500 passing test cases which the team framed as ~22% developer time saved — the 22% being "only the documented cases where the test case passed", so the true saving is presumably higher. All automated output was human-verified before merge.
The tool was open-sourced. In October 2024 Slack released @slack/enzyme-to-rtl-codemod on npm, in response to external-developer demand. The tool is now a reusable artifact for any team migrating off Enzyme.

Systems / concepts / patterns extracted¶

Systems: systems/enzyme-to-rtl-codemod (Slack's tool, open-sourced), systems/enzyme, systems/react-testing-library, systems/claude-2-1 (the LLM in use at the time), systems/jest (the underlying test runner).

Concepts: concepts/abstract-syntax-tree (extended — canonical role as both conversion primitive AND hallucination-control primitive for LLM pipelines), concepts/llm-hallucination (extended — named in a production code-conversion context), concepts/llm-conversion-hallucination-control (new — the structural problem class: LLMs emit plausible-but-wrong code and therefore need mechanism-level mitigations when used for deterministic-correctness tasks like test conversion), concepts/dom-context-injection-for-llm (new — instrumenting component rendering at test time to capture per-case DOM and feed it to the LLM as disambiguating context).

Patterns: patterns/ast-plus-llm-hybrid-conversion (new — the architectural pattern: AST pass handles deterministic cases and emits annotations; LLM handles the rest constrained by those annotations + DOM + original code + prompt; up to 20-30% quality lift over naive LLM prompting), patterns/in-code-annotation-as-llm-guidance (new — using a deterministic pre-pass to write in-code comments that shape LLM attention, rather than stuffing instructions into the prompt).

Operational numbers¶

Metric	Value	Notes
Total Enzyme test cases at Slack	15,000+	migration scope
Estimated manual engineering cost	10,000+ hours	motivated build-vs-buy
`find` method call sites	13,244	top method in codebase
Total Enzyme methods in codebase	65	long-tail
Pure-AST conversion quality	~45%	ceiling; shipped as first-pass tool
Pure-LLM conversion quality	40-60%	variance by task complexity
Hybrid-pipeline conversion quality	~80%	on selected evaluation files
LLM used	Anthropic Claude 2.1	integrated by Slack DevXP AI
On-demand run time	2-5 min / file
Codemod adoption rate	~64%	fraction of migrated files
CI-nightly run scope	~338 files / ~2,300 test cases
Auto-passing test cases	~500 (~22%)	of the 2,300 examined
Developer time saved	~22%	lower-bound (passing-only)

Caveats¶

Claude 2.1 is an older model. Numbers would likely be different today with Claude 3.5 / 3.7 / 4 or GPT-4-class models; the architectural lesson (hybrid AST+LLM, context injection, in-code annotations as guidance) generalises but the 80% quality ceiling is model-dependent.
No p50/p99 latency disclosure for on-demand runs beyond the 2-5 minute range; no cost-per-conversion disclosed; no LLM-token-usage budget disclosed.
Evaluation methodology is narrow: 9 files chosen across three complexity buckets for quality grading; 338 files / 2,300 cases for the pass-rate number, but which 338 files (hand-selected? representative?) isn't specified.
"80% quality on selected files" is not "80% fully-converted pull-ready files." Quality here means per-line accuracy against a human-written rubric (imports, rendering methods, JS/TS logic, Jest assertions); the remaining 20% still needs manual review. The CI pass-rate buckets (fully / 50-99% / 20-49% / <20%) suggest most files land in partial-conversion territory.
Human verification before merge was mandatory. The post is explicit: "the generated code was manually verified by humans before merging into our main repository" — the tool is a productivity multiplier, not a closed-loop auto-committer.
"Only an LLM was capable" of consuming the heterogeneous context pile (prompt + DOM + test code + React code + test logs + linter logs + AST-converted code) — this framing is true in 2024 but worth re-examining as new tools emerge.
Scope-adjacent, not distributed-systems proper. This wiki's bias is toward distributed-systems internals; this post is about a developer-productivity tool built with an LLM. Ingested on borderline-case grounds because the hybrid architectural pattern is reusable and the production numbers are real.

Source¶

Original: https://slack.engineering/balancing-old-tricks-with-new-feats-ai-powered-conversion-from-enzyme-to-react-testing-library-at-slack/
Raw markdown: raw/slack/2024-06-19-ai-powered-conversion-from-enzyme-to-react-testing-library-b32d8d29.md
HN discussion: news.ycombinator.com/item?id=40726648
Open-source tool: @slack/enzyme-to-rtl-codemod