Token efficiency
The gap is structural.
Playwright MCP snapshots the entire page on every tool call. Otto scouts selectively and extracts only what's needed.
Task 1 — Wikipedia article extraction
Task 2 — Form fill + response verification
Run data
Consistent across every run.
Three runs per task, zero retries. Tool call sequences were identical in every run.
Task 1: Fact Lookup
en.wikipedia.org/wiki/Python_(programming_language)
| Run 1 | Run 2 | Run 3 | Mean |
| Wall-clock | 2.5 s | 2.0 s | 2.0 s | 2.2 s |
| Tool calls | 4 | 4 | 4 | 4 |
| Success | Yes | Yes | Yes | 3/3 |
| Retries | 0 | 0 | 0 | 0 |
Tool call breakdown
1. launch_session — open browser, navigate to URL
2. scout_page_tool — structural overview (summary mode)
3. execute_javascript — extract first paragraph text
4. close_session — release browser
Task 2: Form Fill
httpbin.org/forms/post
| Run 1 | Run 2 | Run 3 | Mean |
| Wall-clock | 8.1 s | 8.2 s | 8.35 s | 8.2 s |
| Tool calls | 11 | 11 | 11 | 11 |
| Success | Yes | Yes | Yes | 3/3 |
| Retries | 0 | 0 | 0 | 0 |
Tool call breakdown
1. launch_session — open browser, navigate
2. scout_page_tool — structural overview
3–8. execute_action_tool — type name, tel, email; click size, topping; type comments
9. execute_javascript — set delivery time + submit form
10. execute_javascript — extract echo response
11. close_session — release browser
Model comparison
Same tools, different speed.
Token efficiency is model-independent — identical tool responses regardless of model. Wall-clock time is where models differ.
Token efficiency is model-independent — identical tool responses regardless of model.
Sonnet is recommended for automation workflows where speed matters.
Methodology
How we measured.
- Token measurement
- Per-turn JSONL analysis of
cache_creation_input_tokens deltas in parent session logs. Otto tool response sizes are model-independent — the same payloads are returned regardless of which Claude model processes them.
- Wall-clock measurement
session_duration_seconds from close_session MCP tool responses. Measures browser session duration only, excluding model reasoning time and orchestration overhead.
- Playwright MCP baseline
- ~124,000 tokens per tool call based on Playwright MCP published baselines. Playwright snapshots the full accessibility tree on every interaction.
- Model-independent vs model-specific
- Token efficiency ratios (98× and 33×) are model-independent — confirmed identical across Opus 4.6 and Sonnet 4.6. Wall-clock times are model-specific: Sonnet completes the same tasks ~4.7× faster.
- Benchmark version
- v0.2 (February 2026). 3 runs per task, subagent orchestration, zero retries. v0.1 steady-state token data used for efficiency ratios.