AI-Rx - Your weekly dose of healthcare innovation
Estimated reading time: 5 minutes
TL;DR
OpenAI announced that GPT-5.5 Instant received physician approval that exceeded physician-written responses. That claim deserves skepticism.
We're likely measuring communication quality, not clinical reasoning. And the recent PrIME-LLM study showed LLMs still fail at differential diagnosis 80-100% of the time. Better explanations don't fix that gap.
The Announcement
OpenAI just published a health update: GPT-5.5 Instant now performs at "frontier" levels.
230 million people use ChatGPT weekly for health questions.
Physicians reviewed responses and rated GPT-5.5 higher than physician-written responses on five criteria: accuracy, clarity, completeness, instruction following, and health decision helpfulness.
It sounds impressive. Let's look closer.
Why This Claim Is Suspicious
The physicians writing responses had unlimited time and internet access.
They weren't rushed. They had resources.
A separate panel of physicians then compared their responses with GPT-5.5.
GPT-5.5 won.
This raises a question: if expert physicians with unlimited time and tools lose to an LLM, either the evaluation methodology favors LLMs, or we're measuring the wrong thing.
Most likely: we're measuring communication, not clinical reasoning.
What We're Actually Measuring
GPT-5.5 is excellent at sounding authoritative.
It explains concepts clearly. It cites sources. It provides comprehensive information.
That's genuine progress in communication.
But sounding good isn't the same as reasoning well clinically.
A fluent, well-explained answer can still be built on flawed reasoning.
What Research Shows
The PrIME-LLM study evaluated 21 frontier LLMs on clinical reasoning.
Differential diagnosis failure rates: 80-100% across all models.
Final diagnosis failure rates: 9-39% across all models.
This gap shows a structural problem: LLMs collapse prematurely onto final answers without exploring alternatives.
Better communication doesn't fix premature closure.
The Scale Problem
230 million people per week use ChatGPT for health questions.
If the model fails differential diagnosis 80% of the time—but sounds confident and authoritative—you're distributing fluently wrong answers
at massive scale.
Clarity makes the problem worse, not better.
A confused patient might double-check. A patient who reads a clear, well-sourced explanation might trust it without verification.
The Specific Gap
OpenAI claims improvement in "recognizing when urgent care is needed."
But LLMs are pattern-matching systems, not clinical reasoners.
They can be trained on examples of language that signals urgency.
That's not the same as understanding why a situation needs urgent care, or catching the unusual presentation that doesn't fit the pattern.
The Real Question
If OpenAI's own evaluation shows GPT-5.5 outperforming physicians—why does independent research show it failing at differential diagnosis?
Three possibilities:
→ The benchmarks measure different things
→ The benchmarks don't capture real clinical reasoning
→ The claims are overstated
Probably all three.
What Needs to Happen
Better communication is genuine progress.
But it's not progress toward safe clinical reasoning.
Acknowledge what the model can do: explain concepts, summarize information, provide context.
Acknowledge what it cannot do: generate proper differential diagnoses, navigate genuine uncertainty, catch unusual presentations.
Right now, that second acknowledgment is missing.
And 230 million weekly users deserve better.
Talk soon,
Bhargav
P.S. This connects directly to the evidence infrastructure questions in The Future of AI in Healthcare - how do we measure what matters clinically, not just what sounds impressive?
Sources
OpenAI. "Improving health intelligence in ChatGPT." May 2026.
Rao AS, Esmail KP, Lee RS, et al. "Large Language Model Performance and Clinical Reasoning Tasks." JAMA Network Open. 2026;9(4):e264003.