Skip to content

PATTERN Cited by 1 source

Agent-driven headless browser

Shape

Give a coding agent a full browser as a first-class tool — usually a headless Chrome driven via CDP or Playwright — and let the agent verify its own front-end changes by operating the page directly: inspecting the DOM, reading JavaScript state, triggering events, reading console output, watching network traffic. Not by taking and comparing screenshots.

Why it exists

Screenshot-iterating agents work for simple pages but fail on:

  • Dynamic state invisible to the rasteriser (validation state, disabled buttons, open menus).
  • JavaScript errors logged to console but not visually surfaced.
  • Network state (WebSocket / XHR) irrelevant to pixel output but critical for real-time apps.
  • Live-reload-style frameworks ( Phoenix LiveView, Phoenix Channels) where the pixel output is a consequence of correct server-push state.

Giving the agent the same programmatic surface a developer would use in Chrome DevTools closes those gaps.

Canonical instances

Colocated (browser lives in the agent's VM)

Phoenix.new (Fly.io, 2025-06-20) — every session VM ships a full Chrome the agent drives. From the post: "The Phoenix.new agent uses that browser 'headlessly' to check its own front-end changes and interact with the app. Because it's a full browser, instead of trying to iterate on screenshots, the agent sees real page content and JavaScript state – with or without a human present." The UI simultaneously exposes the browser as a live preview for the human to watch.

Proxied (browser lives in a platform endpoint)

Cloudflare MoltWorker / Browser Rendering (2026-01-29) via patterns/cdp-proxy-for-headless-browser — the browser endpoint is a tenant-scoped service the agent hits over CDP-over-network. Same signal surface, different deployment shape: multi-tenant at the platform level; session-local at the tenant level.

MCP-wrapped (browser exposed through an MCP tool server)

Playwright MCP / browser-mcp variants — same CDP substrate surfaced as a narrower MCP tool interface. Agent calls high-level actions ("click selector X", "fill input Y") rather than dropping into raw CDP.

Trade-offs

Axis Colocated browser Proxied browser MCP-wrapped
Latency to first byte Intra-VM (microseconds) Network (~10s of ms) Network + MCP roundtrip
Tenant isolation Per-session VM Platform handles Per-client session
Resource cost Chrome per session VM Shared Chrome fleet Varies
Interface width Full CDP Full CDP Narrower (allowlisted actions)
Best fit Coding agent in cloud IDE Ephemeral scraping / extraction Cross-vendor agent frameworks

Context-window cost considerations

DOM dumps are much larger than screenshots. A full page DOM on a rich LiveView or React app is easily tens of KB of tokens. Real agent implementations sample — "just the attribute of this element" or "just the matches for this selector" — rather than dumping entire documents. The 2025-06-20 post doesn't disclose Phoenix.new's specific sampling strategy.

Implementation ingredients

  • Browser runtime — usually Chromium or Chrome in headless mode, though Phoenix.new's UI exposes the browser window as a visible preview too.
  • CDP client — raw CDP, Playwright, or a narrower MCP wrapper.
  • Agent tooling layer — the prompt-chain that teaches the agent which CDP operations are worth using and when.
  • Context-window-aware sampling — selective DOM / state extraction rather than full-page dumps.

Caveats

  • Not a substitute for visual-regression testing. Pixel-level visual diffs (chart rendering, typography) still want screenshots or dedicated image-diff tools.
  • Context cost is non-trivial. Bad sampling can blow context windows on a single page.
  • Network and auth posture inherited from the browser's host. A colocated browser has the VM's reach; a proxied browser has the platform endpoint's reach. Credentials in the loop must match the deployment shape.

Adjacent patterns

Seen in

Last updated · 200 distilled / 1,178 read