The Vision Mode — Screenshot-Based Interaction

Vision mode lets Claude see the page — actual rendered pixels, not the accessibility tree. It's the right tool for layout questions, image content, canvas-based UIs, and visual regression checks. It's the wrong tool for everything else, because it's slower, more expensive, and less reliable than snapshot mode at the things snapshot mode is good at. This lesson covers when to flip the switch, how the screenshot tool flow works, and the hybrid pattern that most real sessions end up using.

The shorthand: snapshot is what users hear; vision is what users see. If your prompt is about either, pick the matching mode. If it's about both, run hybrid.

Two ways to use vision

Vision arrives in two flavours, and the difference matters.

Default-on vision — every interaction includes a screenshot. Set with the --vision flag:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest", "--vision"]
    }
  }
}

Best for sessions whose primary task is visual reasoning — comparing pages, reviewing layouts, walking through a canvas-based tool. Every turn pays the image-token cost, so don't leave it on by default.

Opportunistic screenshots — leave the server in snapshot mode and let Claude call the screenshot tool only when needed:

Open https://example.com. Then take a screenshot and tell me whether the cart icon
is visible above the fold.

The assistant uses the snapshot for navigation and form work, calls browser_take_screenshot for the one moment that needs vision, and continues. This is the pattern most teams settle on once they've used both — and it's the pattern this course recommends.

How the screenshot path actually works

The assistant calls browser_take_screenshot (or the equivalent browser_screen_capture-style tool depending on server version).
Playwright captures the page as a PNG — full page or just the viewport, depending on arguments.
The image is returned to the host and attached to the model's context as a multimodal input.
The model can now describe the image, reason about layout, identify visible elements, and — when running with the --vision flag — issue clicks against pixel coordinates rather than accessibility refs.

The tool-call panel shows a thumbnail of the captured image, which is invaluable when debugging "the model says it sees X but I see Y."

Where vision earns its slot

Visual layout assertions. "Is the CTA above the fold on a 1366×768 viewport?" — answerable from a screenshot, opaque from a tree.
Image content. "What products are shown on this page?" — when the page leans on imagery instead of text labels (common in fashion, design, art).
Canvas / WebGL UIs. Figma, Lucidchart, charting libraries that render into <canvas>. The accessibility tree is empty for these regions; pixels are the only signal.
Visual regression. "Does the staging homepage match the production homepage? List visible differences." — a one-shot review job that's faster than standing up a full visual-diff platform.
Design-mockup comparison. "Here is the Figma export. Walk through the live page and flag where the layout deviates." — uploads plus screenshots is a surprisingly effective pairing.

Where vision is overkill

Form filling and navigation. Snapshot mode does this faster, cheaper, and more reliably. Vision adds latency and noise without adding signal.
Reading textual content. The accessibility tree already contains the text, named and structured. Sending pixels for the model to OCR is wasteful.
Anything that needs precise targeting. A ref from the snapshot points to exactly this DOM node. Pixel coordinates can drift across viewports, font sizes, and zoom levels.

Snapshot vs vision at a glance

When to use which mode

Snapshot mode (default)

Compact text payload — cheap on tokens
Stable refs survive re-snapshots
Best for forms, navigation, text reading
Doubles as a free accessibility audit

Vision mode

Full screenshot — expensive on image tokens
Coordinates can drift across viewports
Best for layout, images, canvas UIs
Use opportunistically via screenshot tool

The hybrid pattern most teams land on

You are running a checkout flow. Use the accessibility snapshot for all navigation and
form interactions. Take a screenshot only at the end, on the order confirmation page,
and verify visually that the order number, total, and "thank you" message are visible.

The session uses cheap snapshots throughout the busy parts of the flow, then takes a single screenshot at the moment where pixels actually matter. Cost stays predictable, accuracy stays high. Be explicit about when to switch modes — left to its own devices, the assistant either uses too few screenshots (missing layout issues) or too many (burning context).

Visual regression in one prompt

A genuinely useful one-shot:

Take a screenshot of https://staging.example.com/. Then take a screenshot of
https://www.example.com/. Compare the two and list any visible differences in
layout, content, or styling.

The model returns a structured list — "the staging hero shows a different headline," "the CTA has moved 20px lower," "the trust-badges row is missing on production." It is not as precise as Applitools or Percy, and the comparisons are interpretive rather than pixel-exact. But for ad-hoc review, it's hours faster than a manual walkthrough and dollars cheaper than a dedicated visual-diff platform.

When you find something the AI keeps missing — small font tweaks, anti-aliasing artefacts — that's the boundary where dedicated visual tools earn their keep.

Cost and latency reality check

Image tokens are roughly 5–15× more expensive than text tokens, depending on the model tier. A vision-heavy 30-step session can cost more than ten snapshot-only sessions. Two practical levers:

Crop or screenshot specific elements rather than full pages, when the question is local. "Take a screenshot of just the header" targets the cost.
Cache the comparison. If you're running the same staging-vs-prod check daily, the prior screenshot is reusable; you only re-capture the side that changed.

Treat vision as a precision instrument, not a default mode.

⚠️ Common mistakes

Leaving --vision on for everything. Every single turn pays the image-token cost, including ones that just type into a form. Costs balloon, latency suffers, and the snapshot-based reasoning that actually drives most flows is suppressed. Use opportunistic screenshots unless the entire session is genuinely visual.
Trusting visual diffs to catch logic bugs. Vision mode is great for "this looks different" and bad for "this is functionally wrong." A page that looks like it accepted the form might have rejected it silently. Pair vision checks with snapshot or network assertions for behavioural certainty.
Letting the model click on pixel coordinates when a ref exists. When both modes are available, prefer the ref-based click. Coordinates are fragile; refs survive resizes, font changes, and minor reflow. Be explicit in the prompt: "prefer accessibility refs over coordinates wherever possible."

🎯 Practice task

Run a hybrid session and a visual-diff session. 25 minutes.

Hybrid: prompt the assistant to log into your staging app via the snapshot mode, then take a screenshot at the post-login dashboard and verify visually that the user's avatar, name, and unread-message count are visible. Note in the tool panel which calls were browser_snapshot and which were browser_take_screenshot.
Visual diff: pick two URLs you can legitimately compare — staging vs production, or two branches deployed to preview environments. Ask the assistant to take a screenshot of each and list visible differences. Confirm at least one finding manually before trusting the rest.
Edit your claude_desktop_config.json (or Claude Code config) to add --vision and restart. Re-run the hybrid prompt. Notice the difference: every turn now includes a screenshot, the session feels slower, and the model occasionally clicks via coordinates. Remove the flag, restart, and confirm the cheaper behaviour returns.
Stretch: ask the assistant to "take a screenshot of just the page header" on a complex page. Compare the cost and clarity to a full-page screenshot. This kind of targeted capture is the cheapest way to use vision mode in production workflows.

You now understand both perception modes. The next lesson covers the most-used core tools — navigate, click, type, wait — that drive the bulk of every real session.