LLMs in medicine: from the USMLE benchmark to the real consulting room

OpenAI's o3 scores 96% on MedQA — the American medical licensing exam (USMLE). GPT-5 scores 73% on HealthBench. In April 2026, a study by Harvard Medical School and Beth Israel Deaconess Medical Center, published with repercussion on NPR, concluded that an OpenAI reasoning model equaled or surpassed physicians in diagnostic accuracy for real patient cases. These numbers are real. But there is a gap between benchmarks and clinical practice that the industry has not yet solved — and understanding this gap is more important than any individual score.

What medical benchmarks measure

MedQA is a multiple-choice exam with 4 options per question, based on the USMLE question bank. Newly graduated physicians pass with 65%. Experienced physicians typically score 85-90%. o3 scores 96%.

The criticism of MedQA as a measure of real clinical competence is well founded: multiple-choice questions with 4 alternatives are the most favorable possible environment for language models. The model only needs to identify the most plausible answer among predefined options. In clinical practice, the options are not listed.

HealthBench, created by OpenAI with 262 physicians from 60 countries and 26 specialties, is an attempt to measure what really matters: quality of communication, handling of uncertainty, appropriateness of escalation ("refer to a specialist"), safety in emergency triage, and precise clinical guidance across multiple conversation turns. GPT-5 scores 73% — impressive, but 23 points below the score on MedQA. This 23-point difference is the distance between knowing medicine and practicing it in a real context.

The knowledge-practice gap

A systematic review published in JMIR in 2025, covering 39 clinical LLM benchmarks, quantified the problem precisely. Diagnostic accuracy falls from 82% in traditional clinical vignettes to 62.7% in multi-turn dialogues with patients — a drop of 19.3 percentage points. Only 5% of studies evaluated LLM performance on real patient data. Only 4 peer-reviewed studies documented real clinical implementation worldwide up to 2025.

The gap exists because clinical practice involves: incomplete information provided by the patient, symptoms that change over the course of the consultation, multiple simultaneous conditions, socioeconomic context that affects management, and the need to decide when to say "I don't know" and refer. None of these elements is present in multiple-choice questions.

The most deployed product: documentation, not diagnosis

The medical AI product with the greatest penetration in 2026 is not a diagnostic assistant. It is Microsoft DAX Copilot (Nuance), a system that captures the conversation between physician and patient and automatically generates a draft clinical note.

The reason DAX Copilot reached more than 10 million clinical encounters when other medical AI products remain in pilot is simple: it does not make diagnoses. It captures what the physician said and did, and structures it into a clinical format. The physician reviews and signs. The error is not critical — it is just a poorly written note. The regulation is manageable. The value is immediate: 7 minutes saved per consultation, 50% less time on documentation after the consultation.

Hippocratic AI, specialized in agents for non-diagnostic tasks (discharge education, caregiver preparation, post-hospitalization follow-up), used the same logic: scaling off the radar of medical device regulation by not making diagnoses. The result is 1.8 million completed calls with 8.95/10 patient satisfaction.

The Harvard/Beth Israel case: what the study really says

The study published in April 2026, which concluded that OpenAI's reasoning model "surpassed physicians in diagnosis," deserves careful reading. It evaluated clinical cases presented as structured text to the model and to physicians under test conditions — without direct contact with the patient, without physical examination, without the ability to order additional tests. Under the conditions of the experiment, the model was more accurate.

Under the conditions of the experiment — not in real practice. The distinction matters. Physicians have access to information that text does not capture: the general appearance of the patient (the "facies"), the way they breathe, the result of the physical examination, the intuition accumulated over years of seeing how diseases present in atypical ways. None of these inputs were available to the physician or to the model in the study.

The result is valid and relevant. But it does not mean that LLMs should replace physicians. It means that, as a decision-support tool in specific contexts, the potential is real — and that the next 2 to 3 years will define what those contexts are.

What medical benchmarks measure

The knowledge-practice gap

The most deployed product: documentation, not diagnosis

The Harvard/Beth Israel case: what the study really says

Recibe las publicaciones