Gemini Tackles Hallucinated Patent Numbers, Overhauls FTS5‑LLM Analysis Pipeline
Photo by Alexandre Debiève on Unsplash
The pipeline was supposed to pair SQLite’s precise FTS5 search with Gemini’s analytical flair, but reports indicate every hypothesis returned zero hits, prompting a redesign that now curbs hallucinated patent numbers.
Key Facts
- •Key company: Gemini
The redesign stems from a three‑stage pipeline that originally paired Gemini’s hypothesis‑generation engine with SQLite’s FTS5 full‑text search, then fed the retrieved records back to Gemini for analysis. According to the author’s March 21 post on media.patentllm.org, the first stage worked as intended—Gemini could spin out research hypotheses and suggest keyword strings—but the second stage consistently returned zero hits from the 3.5 million‑record patent corpus. The root cause, the author explains, was Gemini’s naïve construction of FTS5 queries: it wrapped entire Boolean expressions in quotes, turning “retrieval AND augmented AND generation” into a literal phrase search that no patent contains, and it produced overly specific multi‑word phrases such as “patent portfolio comparison similarity analysis” that never appear verbatim.
When the database returned no matches, Gemini did not flag the empty result set. Instead, it fabricated plausible‑looking patent identifiers, labeling them with an “[Inference]” tag. Sample outputs included “US‑2021/0234567 – ‘System for Retrieval‑Augmented Generation…’” and “US‑2022/0891234 – ‘Neural Network‑Based Patent Analysis…’”. While the tag technically signaled speculation, downstream consumers could easily mistake the numbers for real filings, highlighting a classic hallucination problem that the author says “concrete[ly] demonstrates why LLM‑generated content needs ground‑truth validation.”
To stop the hallucinations, the author implemented three fixes. The first replaces Gemini‑generated keywords with a manual override when the hypothesis payload contains an explicit “fts_keywords” field, ensuring that only syntactically correct FTS5 strings—e.g., “retrieval augmented” OR “RAG” OR “retrieval‑augmented”—are sent to SQLite. The second fix addresses the subtle case‑sensitivity of the OR operator in FTS5: the author notes that “OR must be uppercase and used as an infix operator,” and provides a corrected example that groups terms properly—( “patent portfolio” ) AND ( comparison OR similarity ). The third adjustment adds explicit handling for zero‑result queries, prompting Gemini to acknowledge the lack of matches instead of inventing citations.
The overhaul arrives at a moment when AI‑augmented search tools are gaining commercial traction. The Information recently reported that Perplexity AI enjoys a 60 % gross‑profit margin, underscoring the market appetite for efficient, AI‑driven retrieval systems. Meanwhile, Ars Technica’s coverage of developer burnout around AI coding agents reflects broader industry concerns about the reliability and ergonomics of LLM‑powered pipelines. By tightening the interface between Gemini and SQLite, the author aims to align the precision of traditional database search with the analytical depth of large language models, offering a more trustworthy workflow for patent analysts and R&D teams.
While the fixes are technically straightforward, they illustrate a larger lesson for AI integration projects: the need for domain‑specific validation layers. As the author’s experience shows, even a powerful LLM can produce syntactically invalid queries that derail an entire pipeline, and without safeguards it will fill the gaps with hallucinated data. The revised Gemini‑FTS5 system now reports genuine zero‑result cases and relies on manually vetted keyword strings, a modest but essential step toward reliable AI‑assisted patent research.
Sources
No primary source found (coverage-based)
- Dev.to AI Tag
Reporting based on verified sources and public filings. Sector HQ editorial standards require multi-source attribution.