Goals, not selectors: building a browser runtime for AI agents
Browser automation was built for humans and retrofitted for agents. Building AgentBrowser meant inverting that, an API that speaks goals, a cursor that actually moves, and per-site memory that skips the model on repeat visits.
The thing that pushed me to build this was watching an agent burn a thousand
tokens guessing whether the button was #submit-btn or #submit-btn-v2, get
it wrong because the site shipped a redesign, and have no way to tell me what it
had actually clicked. Three separate failures, interface, robustness, audit -
and all three trace back to the same root: the tool was designed for a human
reading a screen, and an agent is not that.
Speak goals, return meaning
The first inversion is the API. Instead of click("#submit-btn-v2"), the agent
says what it wants:
await session.act({ goal: 'submit the payment form' });The runtime resolves that goal, DOM heuristics first, a vision model only when the DOM is ambiguous, and the agent never sees a selector. The reciprocal change is on the way back: instead of an 8,000-token HTML dump, the page is returned as a compact structured summary of what’s actionable, on the order of tens of tokens. The agent reasons over meaning, not markup.
A cursor that actually moves
The cursor is real and visible, and every input goes through CDP raw mouse and
keyboard events along a Bézier trajectory rather than synthetic DOM
click() calls. I went down this road for two reasons. The honest one: it makes
the agent’s behaviour legible: you watch the pointer move and you know exactly
what it’s about to do. The practical one: a lot of the modern web distinguishes
synthetic events from real ones, and an automation that never produces human-
shaped input gets a different (worse) experience than a person does.
Memory is what makes it fast
The change I underestimated was action memory. The first time the agent solves a flow on a site, the resolution, which element satisfied the goal, how it was reached, is persisted. On the next visit the runtime replays the remembered action and skips the model call entirely. Repeat runs that re-planned every step collapse to a fraction of the time and tokens once the path is known.
That memory is also shared across domains, so a pattern learned on one checkout helps on the next. It’s the difference between an agent that’s expensive every single run and one that gets cheaper the more it works.
What I’d tell someone starting this
It’s TypeScript over real Chromium via Playwright, provider-agnostic across roughly seventeen LLM backends behind one environment variable, Claude, GPT, Gemini, local Ollama, vLLM, and the rest, because locking an agent runtime to one vendor is a decision you regret in a quarter. It exposes itself as a library, an MCP server, and HTTP/WS/SSE, since “how does the agent connect” is not a question I wanted to answer once and freeze.
The lesson that outlived the code: when you build infrastructure for agents, stop porting human tools. Ask what the agent is actually trying to do, design the smallest interface that expresses that, and spend the saved complexity on verification and memory, the two things a human gets for free and an agent doesn’t.