Screenshot Loops as Security

I thought Claude's computer use (the new Chrome extension)  was slow (watching it work felt like waiting for a careful driver at a four-way stop). Screenshot, analyze, take action, screenshot again. Repeat until done.

Then Google launched Gemini's computer use last week. I checked the docs, curious if they had some different way to do this. Nope. Same screenshot loop. Take a picture, think, act, take another picture.

So either they both missed an obvious optimization, or the screenshot approach isn't actually the limitation it appears to be.


Perplexity's Comet browser went the other direction. It has direct DOM access: the AI can see the actual structure of the webpage, not just pixels. This should be faster. And according to users, it is. No screenshot delay, no processing lag. Just direct interaction with the page elements.

It's also a security nightmare.

Brave Security found that Comet feeds webpage content directly to its LLM without distinguishing between user instructions and untrusted content from the webpage. An attacker can embed malicious instructions in a Reddit comment or a blog post, and when Comet reads that page, it treats those hidden instructions as legitimate commands. Researchers demonstrated attacks where a single malicious URL could hijack the browser to exfiltrate emails, calendar data, and other sensitive information.


The screenshot loop isn't a bug. It's a security boundary.

When Claude or Gemini take a screenshot, they're creating deliberate distance between the AI and the raw web. They see rendered pixels, not DOM nodes. They can't accidentally read invisible text or execute instructions buried in HTML comments. The "slow" approach is actually a trust boundary: treating everything from the web as potentially hostile until proven otherwise.

This is a pattern we've seen before. ORMs can sometimes feel "too slow" compared to raw SQL, until you think about SQL injection. Type-systems can feel like “writing extra code”, until you waste time chasing down type-related bugs.

It’s the same speed vs safety lesson: the thing that feels slow often turns out to be doing important work you didn't notice.


Most likely, Comet's approach will probably get faster adoption among users who “just want” snappy interactions. Claude and Gemini's approach will probably dominate in enterprise settings (or with still-somewhat-paranoid folks, like me) where one prompt injection attack could cost millions.

We might even get a “standard protocol” to make that screenshot loop faster – that’s often how the web standards have evolved (i.e. formalizing the safety patterns we stumble into).

Anyway, different tradeoffs for different threat models. As it should be.


(but ... do try out some of these alternatives – just don’t judge it too much for taking a screenshot and think "this could be faster," leave it alone, switch tabs, do something else and remind yourself that sometimes slow is a feature, not a bug)