
Prefer to listen instead? Here’s the podcast version of this article.
Imagine telling an AI: “Go find the best flight from Manila to Tokyo tomorrow, book it, enter my details, and email me the confirmation.” It sounds futuristic — but Google’s latest release edges the AI world closer to this reality. Their new Gemini 2.5 Computer Use model can interact with web pages almost exactly as a human does — clicking, typing, selecting, scrolling, filling forms — and doing so visually and contextually.
Released in preview via Google’s blog and developer channels, the Gemini 2.5 Computer Use model is a variant of Gemini 2.5 designed to let AI agents directly interact with user interfaces (UIs). [blog.google]
Here’s how it works:
In short: instead of relying on back-end APIs or structured data endpoints, Gemini 2.5 can “see and act” on websites much as a human user does.
Many digital tasks—especially on third-party sites, legacy tools, or consumer services—lack open APIs. To automate such tasks, systems typically resort to screen scrapers, brittle automation scripts, or custom integrations. Gemini 2.5’s UI-level approach helps bridge that gap by making the UI itself the “API.”
To act well on a page, the model must understand visual layout, labeling, affordances, and context. For example: is that button “Next” or “Submit,” or is that link actually a menu? The model needs to interpret visual cues, text context, and prior state. That’s nontrivial.
One recent benchmark, WebGames, tests AI agents on interactive web tasks and found that even top systems achieve only ~43 % success vs. ~95 % for humans. That underscores persistent gaps in handling ambiguity, dynamic elements, asynchronous loading, hidden elements, and complex workflows.
So while Gemini 2.5 is a big leap, it’s not omnipotent yet.
Here are some illustrative and practical domains where this capability could shine:
For businesses, this opens new automations in customer support, e-commerce workflows, lead gen, and much more.
Empowering an AI to drive web pages also introduces risks. Here are the key ones:
Google has noted that safety guardrails are in place for the model. [The Verge] But as capability grows, the need for audits, red teaming, and monitoring becomes critical.
A relevant academic study, “Big Help or Big Brother?”, explores how generative AI browser assistants can track, profile, and personalize user data (sometimes excessively) with little oversight. That analysis is a good companion read to weigh benefits vs risks.
While AI agents have been a trending theme (OpenAI, Anthropic, Google’s earlier agent experiments), Gemini 2.5’s capability differs in that it doesn’t just orchestrate APIs—it operates interfaces. Other systems tend to focus on higher-level orchestration or macro-level automation.
Comparable technologies include:
But none (yet) match the fidelity of visual UI interaction that Gemini claims.
Also, Google’s prior work on embedding Gemini in user-facing contexts—Search, Workspace, Chrome — is a foundation. For instance, Google recently upgraded its AI‑powered Search in more languages.
Google’s Gemini 2.5 isn’t just another AI upgrade—it’s a shift in how machines interact with the digital world. By enabling an AI to browse, click, and complete tasks like a human, we’re stepping into an era where interfaces are no longer just for users—they’re for intelligent agents too.
For businesses, developers, and tech leaders, this opens up new frontiers in automation, accessibility, and user experience design. For the rest of us? It means a future where your AI doesn’t just suggest what to do—it does it for you.
As we embrace these browser-savvy bots, we also need to keep one eye on ethics, security, and control. But one thing’s clear: the age of AI that can “use” the internet is here—and it’s already browsing.
WEBINAR