ChatGPT Health Beats Benchmarks in Structured Triage Test, Study Shows in Nature Medicine

Nature reports that ChatGPT Health outperformed benchmark models in a structured triage recommendation test, achieving top scores across all clinical scenarios in a study published Feb 23, 2026.

Quick Summary

•Nature reports that ChatGPT Health outperformed benchmark models in a structured triage recommendation test, achieving top scores across all clinical scenarios in a study published Feb 23, 2026.
•Key company: ChatGPT Health

ChatGPT Health’s debut in a rigorously controlled triage stress test underscores both the promise and the perils of consumer‑facing AI in medicine. Researchers fed the model 60 clinician‑authored vignettes spanning 21 clinical domains and 16 factorial conditions, generating 960 individual recommendations (Ramaswamy et al., Nature Medicine). Across the board, the system earned the highest scores among benchmark models, but its performance followed an inverted‑U curve: it excelled in moderate‑risk scenarios yet faltered at the extremes. In non‑urgent presentations, the AI mis‑triaged 35 % of cases, while in true emergencies it under‑triaged nearly half (48 %). Notably, 52 % of gold‑standard emergencies—such as diabetic ketoacidosis and impending respiratory failure—were sent to a 24‑ to 48‑hour evaluation instead of the emergency department, even as classic emergencies like stroke and anaphylaxis were correctly escalated.

The study highlights how contextual cues can sway the model’s judgment. When family members or friends downplayed symptoms—a simulated anchoring bias—the odds of a less urgent recommendation surged (odds ratio 11.7, 95 % CI 3.7‑36.6). This bias manifested most starkly in edge‑case vignettes, suggesting that ChatGPT Health’s inference engine is sensitive to narrative framing rather than solely to clinical data. Crisis‑intervention safeguards proved similarly erratic: the AI triggered suicide‑prevention messages more often when users described vague ideation than when they disclosed a concrete method, raising concerns about the reliability of safety nets built into consumer health bots.

Demographic variables appeared neutral in the aggregate analysis. Patient race, gender, and reported barriers to care did not produce statistically significant differences in triage outcomes, although the confidence intervals were wide enough to accommodate clinically meaningful disparities (Ramaswamy et al., Nature Medicine). The authors caution that the absence of detectable bias in this sample does not guarantee equity at scale, especially given the limited vignette set and the model’s evolving training data.

The authors conclude that, despite its headline‑grabbing scores, ChatGPT Health is not ready for unrestricted deployment. They call for prospective, real‑world validation studies that monitor safety endpoints, bias metrics, and the consistency of crisis‑intervention triggers before the tool reaches millions of consumers. The findings echo broader industry warnings that benchmark dominance does not automatically translate into clinical reliability, and they underscore the need for rigorous, human‑in‑the‑loop oversight as AI triage systems move from laboratory to bedside.

ChatGPT Health Beats Benchmarks in Structured Triage Test, Study Shows in Nature Medicine

Quick Summary

Sources

🏢Companies in This Story

Related Stories