Google fuses perception and action in Gemini 3.5 Flash: the agent that watches the screen is no longer a separate model

🕒 Published on Zendoric: June 26, 2026 · 09:00

Google DeepMind natively integrates 'computer use' into Gemini 3.5 Flash, according to its June 24, 2026 announcement. A single model sees, reasons and acts on interfaces. The change looks technical, but it redefines the cost and architecture of enterprise automation.

Some announcements read like a product improvement, while others, looked at closely, rearrange the board. Google DeepMind's decision to integrate the 'computer use' capability directly into Gemini 3.5 Flash —signed off by its Product Manager Mateo Quiros on June 24, 2026— belongs to the second category. Until now, controlling a graphical interface required a separate model, Gemini 2.5 computer use. As of this update, seeing the screen, interpreting it and executing actions ceases to be a standalone service and becomes a native tool of the very model already used at scale for function calling and for grounding with Search and Maps.

The nuance is no small matter. Merging reasoning and visual perception into a single fast, low-cost model like Flash removes the friction of orchestrating two distinct systems: less latency, lower cost per task and a simpler technical stack for any team building agents. In practice, we are talking about a model capable of clicking, typing, navigating between tabs and filling in forms as a human operator would, but oriented toward 'long-horizon' workflows, those many-step chained processes that today consume hours of work. Google cites two concrete areas: continuous software testing and knowledge work over professional applications.

The enterprise reading is the one most worth underlining. A large share of organizations still run critical processes on desktop applications or legacy web systems that never had a modern API. For those environments, screen-vision automation is not an elegant option: it is the only realistic path. The fact that this capability now lives in a cheap, fast model makes Gemini 3.5 Flash a serious candidate to be the core of agents operating over legacy infrastructure without rewriting it.

It is worth, however, not losing sight of the security chapter, which Google addresses with healthy realism. An agent that browses the real web is exposed to indirect 'prompt injection': malicious instructions hidden in a page or an email that attempt to hijack its behavior. The company says it has applied specific adversarial training for this capability and offers two optional safeguards for enterprise deployments —human confirmation before sensitive actions and automatic stopping upon a detected injection attempt—, framed within a 'defense in depth' philosophy. It is the right answer: no training eliminates the risk entirely, and combining it with sandboxing, access control and human oversight is precisely what separates a flashy demo from a deployable system.

On the competitive front, the move comes after Anthropic introduced computer use in Claude in late 2024 and OpenAI developed analogous capabilities. The difference Google claims is native integration versus a separate model. If that promise holds up in production, the real leap will be less about what an agent can do and more about at what cost and with what reliability it can do it at scale. And there, more than in any demo, real adoption will be decided.

Sources & references

blog.google — Google fuses perception and action in Gemini 3.5 Flash: the agent that watches the screen is no longer a separate model