• Author
    Samuel Margolis
  • Discovery PI

    Joanne Elmore

  • Project Co-Author

    David Schriger, Kimon LH Ioannides

  • Abstract Title

    Evaluating the Safety of Patient-Facing Artificial Intelligence Triage Systems: A Standardized Benchmark Study

  • Discovery AOC Petal or Dual Degree Program

    Informatics & Data Science

  • Abstract

    Background/Objective

    Patient-facing AI systems are increasingly used as front-door triage tools, yet no standardized predeployment safety framework exists. We developed TriageBench, a reproducible benchmarking framework, to evaluate unsafe triage errors and information-gathering behavior across patient-facing AI systems under controlled conditions.


    Methods

    We evaluated 6 platforms across 3 categories: LLM API models (Claude Opus 4.7, Gemini 3.1 Pro, GPT-5.4), general-purpose consumer chatbots (ChatGPT, Gemini), and a commercial clinical triage tool (Doctronic). Sixty novel guideline-grounded clinical scenarios (20 Emergency, 20 Clinician evaluation, 20 Home management) were developed with physician-adjudicated reference standards. Each platform completed standardized simulated patient interactions (≤10 response turns), yielding 360 total case-platform encounters. Primary outcomes were unsafe under-triage of Emergency cases and overall triage accuracy. Secondary outcomes included over-triage of Home cases, response burden (question count, word count), and readability (Flesch-Kincaid grade level).


    Results

    No platform under-triaged Emergency evaluation cases (0% unsafe under-triage across all 6 systems). However, clinically significant under-triage occurred in the Clinician evaluation tier: Doctronic incorrectly recommended Home management in 4 of 20 Clinician evaluation cases (20%), followed by Claude Opus 4.7 (3/20, 15%) and ChatGPT (1/20, 5%). Over-triage was concentrated in Gemini consumer, which inappropriately escalated 8 of 19 Home cases (42%). Overall accuracy ranged from 86.7% (Gemini consumer, κ=0.80) to 98.3% (GPT-5.4, κ=0.98). Group-level accuracy was highest for LLM API models (96.1%), followed by the clinical tool (93.3%) and consumer chatbots (91.7%). Doctronic asked the most questions per encounter (median 18 vs. 9 for chatbots) but generated the shortest responses (median 825 words vs. 2,183–2,748 for consumer platforms). Reading level was comparable across platforms (median Flesch-Kincaid grade 7–11).


    Conclusions

    All evaluated systems successfully identified Emergency cases, suggesting a floor effect in the highest-acuity tier. The primary safety-relevant failures occurred in the Clinician evaluation tier, where under-triage to Home management was observed in both the commercial clinical tool and general-purpose chatbots - through distinct failure mechanisms. TriageBench provides a reproducible evaluation framework to support predeployment safety testing of patient-facing AI triage systems prior to broad patient exposure.