-
Author
Ayesha Rasheed -
Discovery PI
Jason Napolitano, MD
-
Project Co-Author
-
Abstract Title
Comparing Four Artificial Intelligence Models for Generating Pre-Clinical, Boards-Style Single Best Answer Multiple Choice Questions in Medical Education
-
Discovery AOC Petal or Dual Degree Program
Medical Education Leadership & Scholarship
-
Abstract
Background:
Practice with board-style questions helps medical students become familiar with the format and the clinical reasoning skills needed to succeed on the USMLE Step 1 and Step 2 exams. Producing high-quality board-style questions is time-consuming, and their quality can vary widely among faculty members. While large language models may improve the speed of question generation, their accuracy and the quality of the generated questions remain unclear.
Objective:
To compare the accuracy and quality of pre-clinical board-style questions generated by three artificial intelligence models.
Methods:
In this cross-sectional study, we utilized four “enterprise-grade” large language models (LLMs) offered by UCLA: Microsoft Copilot, Google Gemini Pro, Google NotebookLM, and nebulaOne (gpt 5.2). Each LLM was given a set of MCQs that did not follow board-style format, as well as a written description of optimal board-style question characteristics. Each LLM was instructed to convert these questions into board-style questions using standardized prompts for consistency. Each generated question was evaluated using a rubric to assess the vignette's clinical relevance, presentation quality, lead-in quality, option homogeneity, cueing and technical flaws, and the incorporation of required data. The difference in success rates between each LLM was determined using Cochran’s Q test with Bonferroni-corrected McNemar tests for post hoc pairwise comparisons.
Results:
NotebookLM failed to follow prompt rules for every question and was therefore excluded from the final statistical analysis. Results indicate that the performance of AI-generated revisions of MCQs varied significantly across models (Cochran’s Q, p=0.0036). Gemini achieved highest success rate (77.8%), followed by nebulaOne (70.4%) and Copilot (53.7%). Post hoc pairwise comparisons show Gemini significantly outperformed Copilot, but Gemini was comparable to nebulaOne (p<0.0167). Rubric dimension analysis indicates the models struggled significantly with presentation quality (18.5%) and vignette clinical relevance (48.1%), but did well with option homogeneity and avoiding technical flaws (92.6% each).
Conclusion:While most tools succeeded in addressing the structural and technical aspects of MCQ revision/generation, there is a consistent performance gap across models in presentation quality and clinical relevance. Gemini Pro is a promising tool for handling the mechanics of board-style question writing, but human input is required to ensure appropriate clinical relevance.
Keywords: Artificial Intelligence, MCQs, Medical Education