Study Finds Not All AI Platforms Are Built to Diagnose

March 16, 2026

As reported by health and science news outlets, including聽,听, and others, a new study led by聽Milan Toma, Ph.D., associate professor in the College of Osteopathic Medicine (NYITCOM), finds that general-use AI platforms are unreliable for medical diagnosis. Toma and his co-authors, which include NYITCOM Senior Development Security Operations Engineer聽Mihir Matalia聽and medical student聽Sungjoon Hong, tested the reliability of some of the world鈥檚 most advanced multimodal large language models, including ChatGPT and Claude. The AI models were tasked with analyzing the same brain scan with clear intracranial pathology of an ischemic stroke near the left middle cerebral artery. The findings reveal a 20 percent rate of fundamental diagnostic error across the AI models, along with concerning variabilities in interpretation and assessment.

鈥淥ur research highlights a critical distinction in the AI landscape,鈥� Toma tells News-Medical. 鈥淢ost successful medical AI tools are task-specific algorithms, trained on large datasets of labeled medical images and validated for very specific diagnostic tasks. However, large language models are not optimized for diagnostics鈥攖hey are built for linguistics and conversation. Accordingly, they generate explanations that sound authoritative, even when their underlying interpretation is wrong or inconsistent.鈥�