Goals, not selectors: building a browser runtime for AI agents

The thing that pushed me to build this was watching an agent burn a thousand tokens guessing whether the button was #submit-btn or #submit-btn-v2, get it wrong because the site shipped a redesign, and have no way to tell me what it had actually clicked. Three separate failures, interface, robustness, audit - and all three trace back to the same root: the tool was designed for a human reading a screen, and an agent is not that.

Speak goals, return meaning

The first inversion is the API. Instead of click("#submit-btn-v2"), the agent says what it wants:

await session.act({ goal: 'submit the payment form' });

The runtime resolves that goal, DOM heuristics first, a vision model only when the DOM is ambiguous, and the agent never sees a selector. The reciprocal change is on the way back: instead of an 8,000-token HTML dump, the page is returned as a compact structured summary of what’s actionable, on the order of tens of tokens. The agent reasons over meaning, not markup.

A cursor that actually moves

The cursor is real and visible, and every input goes through CDP raw mouse and keyboard events along a Bézier trajectory rather than synthetic DOM click() calls. I went down this road for two reasons. The honest one: it makes the agent’s behaviour legible: you watch the pointer move and you know exactly what it’s about to do. The practical one: a lot of the modern web distinguishes synthetic events from real ones, and an automation that never produces human- shaped input gets a different (worse) experience than a person does.

Memory is what makes it fast

The change I underestimated was action memory. The first time the agent solves a flow on a site, the resolution, which element satisfied the goal, how it was reached, is persisted. On the next visit the runtime replays the remembered action and skips the model call entirely. Repeat runs that re-planned every step collapse to a fraction of the time and tokens once the path is known.

The action loop: visit 2+ replays from memory and skips the model

That memory is also shared across domains, so a pattern learned on one checkout helps on the next. It’s the difference between an agent that’s expensive every single run and one that gets cheaper the more it works.

What I’d tell someone starting this

It’s TypeScript over real Chromium via Playwright, provider-agnostic across roughly seventeen LLM backends behind one environment variable, Claude, GPT, Gemini, local Ollama, vLLM, and the rest, because locking an agent runtime to one vendor is a decision you regret in a quarter. It exposes itself as a library, an MCP server, and HTTP/WS/SSE, since “how does the agent connect” is not a question I wanted to answer once and freeze.

The lesson that outlived the code: when you build infrastructure for agents, stop porting human tools. Ask what the agent is actually trying to do, design the smallest interface that expresses that, and spend the saved complexity on verification and memory, the two things a human gets for free and an agent doesn’t.