Gemini Powers Voyance, an AI Agent That Researches the Web by ‘Seeing’ It
Photo by JC Gellidon (unsplash.com/@jcgellidon) on Unsplash
What used to require hand‑coded selectors now happens with a spoken prompt, thanks to Gemini‑powered Voyance, an AI agent that “sees” web pages from screenshots and delivers competitive intelligence in minutes.
Key Facts
- •Key company: Gemini
Voyance’s architecture is built around an “ADK‑style” orchestration loop that mirrors Google’s own AI Development Kit but is trimmed to the essentials of planning, navigation, extraction, verification and reporting. According to the project’s creator, Muhammad Ibtisam Afzal, the loop runs in a single asynchronous process and streams status updates over WebSockets so the front‑end remains responsive while the agent crawls the web (Afzal, “How We Built Voyance”). The pipeline begins with a natural‑language query, which is handed to Perplexity and Gemini to generate a list of target URLs and the specific data points to collect—e.g., pricing tiers, feature lists, and target segments. If Perplexity returns no results, Gemini falls back to a generated research plan that includes intent, target sites, and search queries, ensuring the agent never stalls due to a missing URL list.
Navigation is performed with Playwright, a headless Chromium framework that loads each page and captures a full‑screen screenshot. Crucially, Voyance never accesses the DOM or writes site‑specific selectors; the screenshot is the sole input for downstream extraction (Afzal). This “pixel‑only” approach lets the agent handle a wide variety of web architectures—including single‑page applications and sites behind paywalls—without bespoke code.
Extraction follows a two‑tier strategy. First, the system attempts to scrape structured data using Firecrawl, a service that returns JSON fields such as company name, pricing tiers, and key features. When Firecrawl succeeds, its output is used directly because of its speed. If Firecrawl fails—due to dynamic content, rate limits, or other barriers—the pipeline falls back to Gemini 2.0 Flash, which processes the screenshot with a vision prompt that asks for the same fields in JSON format (Afzal). This fallback eliminates any dependency on HTML parsing and demonstrates Gemini’s multimodal capability to “see” and interpret web pages from raw images.
To guard against hallucinations, Voyance cross‑checks every extracted claim with Perplexity. For each data point, a verification prompt such as “Company X pricing is $49/month” is sent to Perplexity with a low‑temperature, fact‑checker system prompt. The response is parsed for a simple “yes/accurate” flag, which is then displayed in the UI as a confidence badge (verified, unconfirmed, or low) (Afzal). This verification step is essential for competitive‑intelligence use cases where erroneous numbers could mislead business decisions.
The final reporting stage aggregates the verified records and hands them to Gemini for narrative synthesis. Gemini generates a concise briefing tailored to the original query, which is then rendered as spoken audio by ElevenLabs under the “Vera” persona, completing the end‑to‑end voice‑first experience (Afzal). Users can also export the results as sortable tables, CSV, or HTML, providing both immediate auditory insight and data that can be fed into downstream analysis tools. The combination of Gemini’s vision, Perplexity’s verification, and ElevenLabs’ text‑to‑speech creates a seamless pipeline that transforms a spoken prompt into a multi‑modal intelligence report in minutes—something that previously required hand‑coded selectors, custom scrapers, and manual data wrangling.
Sources
No primary source found (coverage-based)
- Dev.to AI Tag
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.