Listen to the patient.
We evaluated AI health tools by removing the very mechanism that makes diagnosis work, measured what happened, and called it a safety problem. It is not just a safety problem. It is a design problem. Medicine knew this in 1975
We evaluated AI health tools by removing the very mechanism that makes diagnosis work, measured what happened, and called it a safety problem. It is not just a safety problem. It is a design problem. Medicine knew this in 1975
Two studies this week, one from Google and one from Microsoft, are being celebrated as evidence that healthcare AI has arrived. Read them together and they reveal something more uncomfortable: AI is filling a gap that healthcare created long before any of us started building AI for health.
The clinical reasoning pattern hasn't changed since Hippocrates. The vessels carrying it have, from paper charts to cloud EHRs. For fifty years, every vessel moved the data without understanding it. AI breaks that pattern for the first time. Not a better vessel. One that thinks.
This week I read three papers that made me happy. JAMA. NEJM. Nature Medicine. All randomized trials. All showing AI outperforming standard care. Then I read the methodology. None of them LLMs. The AI winning in top journals in 2026 was built before the hype cycle. The blueprint was always there.
A Nature study shows LLMs achieve 94.9% accuracy on benchmarks but only 34.5% when laypeople use LLMs on physician-created scenarios. The gap reveals something deeper: We measure the model in isolation. We deploy to a human in distress. The system fails at the intersection.
The physician's job has always been data synthesis. Fragments, approximations, family filters, descriptions of pill geometry. Consumer health AI was built for a different patient. One that has never walked through a clinic door.
Asimov gave robots three non-negotiable laws. Medicine gives physicians an oath. Healthcare AI has governance, but no runtime constitution. Until safety principles are enforced at the moment of output, not just in policy documents, we are deploying systems without the equivalent of “do no harm.”
A Nature Medicine study found ChatGPT Health under-triaged 52% of real emergencies. The deeper issue may not be the model, but its training data: AI learns from documented hospital records, yet it is deployed at first contact, where the most critical signals were never captured.
The New York Times asked eight leading thinkers where A.I. is headed. Most rated its near-term medical impact as small or moderate. They were looking at the wrong scale. The revolution in healthcare A.I. is already here; it just looks like a discharge summary generated in seconds, not hours.
The most important AI benchmark you have never heard of measures one thing: how long can an AI system work on a task before it loses the thread? In healthcare, that matters more than single-answer accuracy. A sepsis protocol is a multi-hour trajectory, not a prompt. Attention span is the real gate.
Governance by design is essential, until 3 a.m. In an emergency, a perfectly compliant blood bank workflow nearly cost a life. Healthcare AI doesn’t need less governance. It needs adaptive governance, rules that scale with risk and allow accountable overrides when seconds matter.
AI governance isn’t nested boxes or monthly committees. It’s architecture. Under HIPAA and GDPR, your “data steward” often can’t even review the data. Real governance is built into the plumbing, validation, lineage, and security enforced automatically. Otherwise, it’s theater.