Salesforce finds AI chatbots grow unreliable as conversations lengthen
Photo by Possessed Photography on Unsplash
Over 200,000 AI conversations analyzed, the Microsoft Research‑Salesforce study finds chatbot answers grow increasingly unreliable the longer a dialogue lasts, Torbenkopp reports.
Quick Summary
- •Over 200,000 AI conversations analyzed, the Microsoft Research‑Salesforce study finds chatbot answers grow increasingly unreliable the longer a dialogue lasts, Torbenkopp reports.
- •Key company: Salesforce
- •Also mentioned: Microsoft Research
The study, conducted jointly by Microsoft Research and Salesforce, examined more than 200,000 interactions across the latest generation of large‑language models—including OpenAI’s GPT‑4.1, Google’s Gemini 2.5 Pro, Anthropic’s Claude 3.7 Sonnet, the “o3” model, DeepSeek R1 and Meta’s Llama 4. By slicing the data into single‑turn queries versus multi‑turn dialogues, the researchers quantified two divergent trends: while overall task‑completion accuracy fell modestly (about 15 percent) as conversations lengthened, the rate of outright unreliability surged by roughly 112 percent. In practical terms, a model that could answer a straightforward request correctly in isolation became increasingly prone to hallucinations, contradictions, or overly verbose replies after just a handful of exchange cycles.
A key mechanism identified by the authors is what they term “premature generation.” In longer sessions the models often emit a response before the user has finished articulating the full context, effectively committing to a conclusion on incomplete information. That initial answer then becomes the de‑facto anchor for subsequent turns, even if it was flawed. The study shows that the system rarely revisits or corrects the original premise; instead, it compounds the error, leading to a cascade of misinformation that the model treats as established fact within the ongoing dialogue.
The phenomenon of “answer bloat” compounds the reliability problem. Across the multi‑step conversations, the average response length grew by 20 to 300 percent compared with single‑turn answers. The researchers found that the extra verbiage was not synonymous with richer content; rather, longer replies contained a higher density of assumptions and fabricated details. These hallucinations were subsequently incorporated into the model’s internal representation of the conversation, reinforcing a feedback loop in which invented facts are treated as verified knowledge for the remainder of the session.
Even the most advanced models equipped with “thinking tokens”—such as the o3 variant from OpenAI and DeepSeek R1—were not immune. While these tokens are intended to give the model a structured internal reasoning phase before producing output, the study observed that the same patterns of premature generation and answer bloat persisted. The implication is that augmentations aimed at improving chain‑of‑thought reasoning do not, on their own, resolve the degradation of answer quality over extended interactions.
The findings arrive at a moment when Salesforce’s stock has risen 5 percent on optimism surrounding its AI‑driven revenue outlook, as reported by Bloomberg. However, the Microsoft‑Salesforce analysis warns that enterprises relying on conversational AI for complex, multi‑step workflows may need to redesign their interaction patterns—perhaps by limiting turn counts, inserting explicit context checkpoints, or employing external verification layers—to mitigate the steep rise in unreliability documented in the study.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.