Salesforce warns: KI-Chatbots become unreliable as conversation length increases
Photo by Compare Fibre on Unsplash
While users expect AI chatbots to stay sharp throughout a dialogue, a joint Microsoft Research‑Salesforce study finds their answers grow increasingly erratic the longer the conversation lasts, Torbenkopp reports.
Quick Summary
- •While users expect AI chatbots to stay sharp throughout a dialogue, a joint Microsoft Research‑Salesforce study finds their answers grow increasingly erratic the longer the conversation lasts, Torbenkopp reports.
- •Key company: Salesforce
- •Also mentioned: Microsoft Research
The joint Microsoft Research‑Salesforce analysis examined more than 200 000 dialogues across the current generation of large language models—including OpenAI’s GPT‑4.1, Google’s Gemini 2.5 Pro, Anthropic’s Claude 3.7 Sonnet, Meta’s o3, DeepSeek R1 and Llama 4—to quantify how conversational depth affects performance. While overall task‑solving ability dipped only about 15 percent as exchanges grew longer, the incidence of outright unreliability surged by roughly 112 percent when users broke problems into natural, multi‑turn sequences (Torbenkopp). This divergence points to a systemic brittleness: the models retain raw capability but increasingly fail to apply it consistently as the context expands.
A key failure mode identified by the researchers is “premature generation,” where the model emits a response before the user has fully articulated the query. In practice, the system extrapolates from an incomplete prompt, commits to a conclusion, and then treats that initial answer as a factual anchor for all subsequent turns. Because the model does not revisit or correct the original premise, early errors propagate and amplify throughout the conversation (Torbenkopp). This behavior contrasts sharply with human dialogue, where speakers routinely seek clarification before forming a definitive reply.
Compounding the propagation problem is the phenomenon the authors label “answer bloat.” In multi‑turn interactions, model outputs grew 20 to 300 percent longer than in single‑shot prompts, yet the added length did not translate into higher informational value. Instead, the expanded text contained a higher density of assumptions and hallucinations—fabricated facts presented as true. Crucially, once a hallucinated detail entered the dialogue, the model incorporated it into its internal context, treating it as verified knowledge for later reasoning steps (Torbenkopp). This feedback loop creates a self‑reinforcing illusion of coherence that erodes factual reliability.
Even the most advanced models equipped with “thinking tokens,” such as DeepSeek R1 and Meta’s o3, exhibited the same degradation patterns. The study found no statistically significant mitigation of premature generation or answer bloat among these variants, suggesting that augmenting token budgets alone does not address the underlying context‑management flaw (Torbenkopp). The researchers therefore recommend architectural changes—such as explicit turn‑completion signals or dynamic context pruning—to prevent early commitments from contaminating the dialogue state.
Salesforce’s broader AI strategy appears to be taking the findings seriously. In a separate Bloomberg report, the company’s stock rose 5 percent after it disclosed a quarterly revenue outlook that beat analyst expectations, citing growing customer adoption of its AI‑driven tools (Bloomberg). While the bullish market reaction underscores confidence in Salesforce’s commercial AI offerings, the new reliability data raises questions about the long‑term viability of chatbot‑centric products for complex, multi‑step workflows. If enterprises continue to rely on conversational agents for mission‑critical tasks, the risk of error propagation highlighted by the Microsoft‑Salesforce study could become a decisive factor in product selection.
For developers and product teams, the practical takeaway is clear: longer conversations demand more robust context handling than current large language models provide. Mitigations may include prompting strategies that force the model to wait for explicit user confirmation before answering, or engineering pipelines that periodically reset the dialogue context after each logical sub‑task. Until such safeguards become standard, the promise of seamless, multi‑turn AI assistance remains constrained by the very architectural choices that enable today’s impressive single‑turn performance.
Sources
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.