-
Author
Samuel Margolis -
Discovery PI
Joanne Elmore
-
Project Co-Author
David Schriger, Kimon LH Ioannides
-
Abstract Title
Evaluating the Safety of Patient-Facing Artificial Intelligence Triage Systems: A Standardized Benchmark Study
-
Discovery AOC Petal or Dual Degree Program
Informatics & Data Science
-
Abstract
Background/Objective
Patient-facing AI systems are increasingly used as front-door triage tools, yet no standardized predeployment safety framework exists. We developed TriageBench, a reproducible benchmarking framework, to evaluate unsafe triage errors and information-gathering behavior across patient-facing AI systems under controlled conditions.
Methods
We evaluated 6 platforms across 3 categories: LLM API models (Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.4), general-purpose consumer chatbots (ChatGPT, Gemini), and a commercial clinical triage tool (Doctronic). Sixty novel guideline-grounded clinical scenarios (20 Emergency, 20 Clinician evaluation, 20 Home management) were developed with physician-adjudicated reference standards. Each platform completed standardized simulated patient interactions (≤10 response turns), yielding 360 total case-platform encounters. Primary outcomes were unsafe under-triage of Emergency cases and overall triage accuracy. Secondary outcomes included over-triage of Home cases, response burden (question count, word count), and readability (Flesch-Kincaid grade level).
Results
No platform under-triaged Emergency evaluation cases (0% unsafe under-triage across all 6 systems). However, clinically significant under-triage occurred in the Clinician evaluation tier: Doctronic incorrectly recommended Home management in 4 of 20 Clinician evaluation cases (20%), followed by Claude Opus 4.7 (3/20, 15%) and ChatGPT (1/20, 5%). Over-triage was concentrated in Gemini consumer, which inappropriately escalated 8 of 19 Home cases (42%). Overall accuracy ranged from 86.7% (Gemini consumer, κ=0.80) to 98.3% (GPT-5.4, κ=0.98). Group-level accuracy was highest for LLM API models (96.1%), followed by the clinical tool (93.3%) and consumer chatbots (91.7%). Doctronic asked the most questions per encounter (median 18 vs. 9 for chatbots) but generated the shortest responses (median 825 words vs. 2,183–2,748 for consumer platforms). Reading level was comparable across platforms (median Flesch-Kincaid grade 7–11).
Conclusions
All evaluated systems successfully identified Emergency cases, suggesting a floor effect in the highest-acuity tier. The primary safety-relevant failures occurred in the Clinician evaluation tier, where under-triage to Home management was observed in both the commercial clinical tool and general-purpose chatbots - through distinct failure mechanisms. TriageBench provides a reproducible evaluation framework to support predeployment safety testing of patient-facing AI triage systems prior to broad patient exposure.