Every token counts.

Measured on Claude Sonnet 4.6 · February 2026 · Per-turn JSONL token analysis with wall-clock from session duration

0×
fewer tokens
0×
fewer tokens
0s
wall-clock
0%
success rate

The gap is structural.

Playwright MCP snapshots the entire page on every tool call. Otto scouts selectively and extracts only what's needed.

Task 1 — Wikipedia article extraction
Playwright
~124,000 tokens
Otto
1,264 tokens
98× fewer
Task 2 — Form fill + response verification
Playwright
~124,000+ tokens
Otto
3,799 tokens
33× fewer

Consistent across every run.

Three runs per task, zero retries. Tool call sequences were identical in every run.

Task 1: Fact Lookup
en.wikipedia.org/wiki/Python_(programming_language)
Run 1Run 2Run 3Mean
Wall-clock2.5 s2.0 s2.0 s2.2 s
Tool calls4444
SuccessYesYesYes3/3
Retries0000
Tool call breakdown
1. launch_session — open browser, navigate to URL
2. scout_page_tool — structural overview (summary mode)
3. execute_javascript — extract first paragraph text
4. close_session — release browser
Task 2: Form Fill
httpbin.org/forms/post
Run 1Run 2Run 3Mean
Wall-clock8.1 s8.2 s8.35 s8.2 s
Tool calls11111111
SuccessYesYesYes3/3
Retries0000
Tool call breakdown
1. launch_session — open browser, navigate
2. scout_page_tool — structural overview
3–8. execute_action_tool — type name, tel, email; click size, topping; type comments
9. execute_javascript — set delivery time + submit form
10. execute_javascript — extract echo response
11. close_session — release browser

Same tools, different speed.

Token efficiency is model-independent — identical tool responses regardless of model. Wall-clock time is where models differ.

Opus 4.6
10.4s
Fact lookup
38.0s
Form fill
4.7×
faster
Sonnet 4.6 Recommended
2.2s
Fact lookup
8.2s
Form fill

Token efficiency is model-independent — identical tool responses regardless of model. Sonnet is recommended for automation workflows where speed matters.

How we measured.

Token measurement
Per-turn JSONL analysis of cache_creation_input_tokens deltas in parent session logs. Otto tool response sizes are model-independent — the same payloads are returned regardless of which Claude model processes them.
Wall-clock measurement
session_duration_seconds from close_session MCP tool responses. Measures browser session duration only, excluding model reasoning time and orchestration overhead.
Playwright MCP baseline
~124,000 tokens per tool call based on Playwright MCP published baselines. Playwright snapshots the full accessibility tree on every interaction.
Model-independent vs model-specific
Token efficiency ratios (98× and 33×) are model-independent — confirmed identical across Opus 4.6 and Sonnet 4.6. Wall-clock times are model-specific: Sonnet completes the same tasks ~4.7× faster.
Benchmark version
v0.2 (February 2026). 3 runs per task, subagent orchestration, zero retries. v0.1 steady-state token data used for efficiency ratios.