AgentBrowser

Problem

Every browser-automation tool was built for humans first and bolted onto agents later. They speak DOM operations, dump thousands of tokens of HTML, re-learn each site every run, and leave no audit trail. An agent does not read a screen the way a person does, so the whole interface was wrong for the job.

Architecture

The API takes goals (act({ goal: 'submit the payment form' })), and the runtime resolves them with DOM heuristics first and a vision model only when the DOM is ambiguous, returning a compact structured summary of what is actionable instead of an 8,000-token HTML dump. Input goes through CDP raw mouse and keyboard events along a Bezier trajectory, so the cursor is real and visible. Every action is verified with a DOM diff and recorded to a JSONL trace you can replay. A per-site action memory persists how a goal was satisfied and skips the model entirely on repeat visits. It runs over real Chromium and is provider-agnostic across roughly seventeen LLM backends behind one environment variable, exposed as a library, an MCP server, and HTTP/WS/SSE.

The action loop with per-site memory

Outcome

Repeat runs collapse to a fraction of the tokens and wall-clock once a site’s path is remembered, and the trace makes “the agent did something weird” reproducible. Full writeup: Goals, not selectors.