Otto — Benchmark Results

Token efficiency

The gap is structural.

Playwright MCP snapshots the entire page on every tool call. Otto scouts selectively and extracts only what's needed.

Task 1 — Wikipedia article extraction

Playwright

~124,000 tokens

Otto

1,264 tokens

98× fewer

Task 2 — Form fill + response verification

Playwright

~124,000+ tokens

Otto

3,799 tokens

33× fewer

Run data

Consistent across every run.

Three runs per task, zero retries. Tool call sequences were identical in every run.

Task 1: Fact Lookup

en.wikipedia.org/wiki/Python_(programming_language)

	Run 1	Run 2	Run 3	Mean
Wall-clock	2.5 s	2.0 s	2.0 s	2.2 s
Tool calls	4	4	4	4
Success	Yes	Yes	Yes	3/3
Retries	0	0	0	0

Tool call breakdown

1. launch_session — open browser, navigate to URL

2. scout_page_tool — structural overview (summary mode)

3. execute_javascript — extract first paragraph text

4. close_session — release browser

Task 2: Form Fill

httpbin.org/forms/post

	Run 1	Run 2	Run 3	Mean
Wall-clock	8.1 s	8.2 s	8.35 s	8.2 s
Tool calls	11	11	11	11
Success	Yes	Yes	Yes	3/3
Retries	0	0	0	0

Tool call breakdown

1. launch_session — open browser, navigate

2. scout_page_tool — structural overview

3–8. execute_action_tool — type name, tel, email; click size, topping; type comments

9. execute_javascript — set delivery time + submit form

10. execute_javascript — extract echo response

11. close_session — release browser

Model comparison

Same tools, different speed.

Token efficiency is model-independent — identical tool responses regardless of model. Wall-clock time is where models differ.

Opus 4.6

10.4s

Fact lookup

38.0s

Form fill

4.7×

faster

Sonnet 4.6 Recommended

2.2s

Fact lookup

8.2s

Form fill

Token efficiency is model-independent — identical tool responses regardless of model. Sonnet is recommended for automation workflows where speed matters.

Methodology

How we measured.

Token measurement: Per-turn JSONL analysis of cache_creation_input_tokens deltas in parent session logs. Otto tool response sizes are model-independent — the same payloads are returned regardless of which Claude model processes them.
Wall-clock measurement: session_duration_seconds from close_session MCP tool responses. Measures browser session duration only, excluding model reasoning time and orchestration overhead.
Playwright MCP baseline: ~124,000 tokens per tool call based on Playwright MCP published baselines. Playwright snapshots the full accessibility tree on every interaction.
Model-independent vs model-specific: Token efficiency ratios (98× and 33×) are model-independent — confirmed identical across Opus 4.6 and Sonnet 4.6. Wall-clock times are model-specific: Sonnet completes the same tasks ~4.7× faster.
Benchmark version: v0.2 (February 2026). 3 runs per task, subagent orchestration, zero retries. v0.1 steady-state token data used for efficiency ratios.