Gemini Powers New Voice‑Controlled Browser Agent Built with Three Advanced Models
Photo by JC Gellidon (unsplash.com/@jcgellidon) on Unsplash
While her smartphone and broadband are ready, she still can’t book a train ticket online; now a Gemini‑driven voice‑controlled browser agent lets users navigate sites hands‑free, turning navigation barriers into seamless clicks.
Key Facts
- •Key company: Gemini
The prototype, dubbed SAHAY, demonstrates how three distinct Gemini models can be chained together to turn spoken intent into fully‑automated web interactions. According to the project’s author, Sherin Joseph Roy, the system “listens to the user in their language… opens a real Chromium browser, finds the correct website, navigates through the pages, fills forms, clicks buttons, and speaks the results back” (Roy, Mar 16). By supporting Hindi, Malayalam, Tamil, Telugu, English and 24 other languages, SAHAY sidesteps the visual‑only barriers that keep many older users from completing tasks such as booking train tickets or downloading government documents. The agent’s multilingual output is also spoken back to the user, ensuring the entire flow remains hands‑free.
The architecture hinges on a “Planner” agent that runs on Gemini 2.5 Flash and is grounded in live Google Search via the GenAI SDK. Roy explains that the Planner “searches the internet in real time to find the correct website… does not use hard‑coded URLs” (Roy, Mar 16). This design choice addresses a real‑world pain point: Indian government portals frequently change URLs and navigation structures, as seen when the UIDAI Aadhaar download page moved twice in the past year. By re‑searching the target site on each request, the system avoids the brittleness that plagues static scrapers.
Once the Planner has identified the destination, it produces a structured execution plan that includes step‑by‑step instructions, visual cues for what to look for on each page, and flags for any steps that involve sensitive data. A second Gemini agent—running on a separate model optimized for tool use—executes the plan within the Chromium instance, performing clicks, typing, and form submissions. The third agent handles speech synthesis and user confirmations, pausing before any login, payment, or data‑entry step to ask for explicit permission, as Roy notes: “The user confirms by voice or by clicking a button” (Roy, Mar 16). For password fields or CAPTCHAs, the system hands control back to the user, allowing a manual click to complete the task.
The broader implications of this three‑agent approach have attracted attention beyond the hackathon. The Register reports that Google is exploring ways to give Gemini direct access to a user’s browser, a capability that would make the kind of real‑time web navigation demonstrated by SAHAY a native feature of the model (The Register). Ars Technica, while noting that Gemini remains “a bad assistant” compared with dedicated tools, acknowledges that the model’s growing competence in multi‑step reasoning is a prerequisite for agents like SAHAY (Ars Technica). Similarly, 9to5Google highlights Google’s recent push to embed AI agents into the Pixel 4’s new Assistant, suggesting that the company sees browser‑level automation as a natural extension of its conversational products (9to5Google).
If the prototype’s promise scales, it could address the “85 % of India’s elderly population” who cannot independently use digital services—a figure cited in Roy’s challenge brief (Roy, Mar 16). Globally, more than 900 million people face comparable barriers, according to the same source. By removing the graphical interface from the equation entirely, SAHAY offers a template for inclusive design that leverages large‑language models not as chatbots but as autonomous agents capable of navigating the ever‑changing web on behalf of users who lack the visual or technical fluency to do so themselves.
Sources
No primary source found (coverage-based)
- Dev.to AI Tag
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.