Artificial intelligence is getting better at sounding human, but that doesn't mean it understands what it's saying. A recent Nature study has thrown cold water on the hype around medical AI, showing that GPT-5—despite progress in reducing obvious errors—still fails in over 50% of complex clinical reasoning scenarios.
The Problem: Confident But Wrong
AI analyst Rohan Paul brought the findings to wider attention on X, warning that while GPT-5 sounds more confident, its accuracy remains dangerously inconsistent. The takeaway? Fluency isn't the same as comprehension.
The study exposes a dangerous paradox: as AI models get better at expressing themselves, it becomes easier to mistake their fluency for intelligence. GPT-5 demonstrated smoother reasoning and fewer outright hallucinations, but researchers found it still produced wrong or misleading answers in more than half of complex medical cases—including multi-step diagnoses, drug interactions, and treatment recommendations.
What's worse, GPT-5 often delivers these wrong answers with total confidence, skipping the cautious hedging that earlier models used. This "confident hallucination" effect can fool even trained evaluators, creating a false sense of credibility that's especially risky in clinical settings.
What GPT-5 Does Well (and Where It Falls Apart)
The study does acknowledge some real improvements:
- Fewer surface-level hallucinations like fake drug names or fabricated studies
- Better coherence in basic medical queries
- More empathetic and readable patient-facing responses
But when it comes to high-stakes, real-world decision-making—especially cases requiring integration of multiple medical variables—GPT-5 consistently breaks down. The model can mimic reasoning, but it doesn't truly grasp cause-and-effect in biology or patient care.
Why This Matters
As hospitals and health startups rush to integrate AI into workflows, the line between decision-support and autonomous decision-making is blurring fast. Without proper oversight, these systems risk introducing misleading diagnoses, biased care across demographics, and murky accountability when errors occur.
Researchers are calling for stronger safeguards: independent validation by medical boards, transparent auditing of training data, and clear accountability frameworks. These recommendations align with ongoing efforts by the WHO and European Commission to establish international standards for medical AI safety.
Usman Salis
Usman Salis