• Author
    Santi Bhattarai-Kline
  • Discovery PI

    Corey Arnold PhD

  • Project Co-Author

  • Abstract Title

    Diagnose-Then-Describe: Chain-of-Thought Reasoning is Necessary for Fine-Tuning Chest X-Ray AI Models on Local Patient Data

  • Discovery AOC Petal or Dual Degree Program

    Informatics & Data Science

  • Abstract

    Introduction: Vision-language models have shown promise in interpreting chest radiographs by generating structured radiology reports, using one or more chest radiograph views, plus the study indication, as inputs. Evaluating their performance on local patient data and deploying them for clinical use, however, remains a challenge. This study evaluates the performance of a state-of-the-art chest x-ray VLM (MAIRA-2) on UCLA patient data and explores fine-tuning strategies to enhance its real-world performance.

    Methods: We cleaned and curated a dataset of 17,332 two-view chest radiographs and accompanying reports, with no prior imaging referenced, from 17,265 adult patients at UCLA outpatient clinics. We divided the dataset into training, validation, and test splits at a ratio of 80:10:10. The split was performed such that all studies from each unique patient remained in the same split and that each split had a balanced ratio of notable findings (determined via report labeling by the AI model CheXbert). We performed supervised fine-tuning (SFT) on MAIRA-2 using low-rank adaptation (LoRA). MAIRA-2 was fine-tuned to perform chain-of-thought (CoT) by pre-pending CheXbert labels to the ground-truth findings and impressions prior to training. Model outputs were evaluated for both lexical and clinical correctness using validated scoring algorithms.

    Results: CoT fine-tuning using UCLA data improved clinical accuracy compared to the base model (median CheXbert macro F1: 0.319 vs. 0.265). Conversely, fine-tuning without CoT degraded clinical performance (median CheXbert macro F1 of 0.249) compared to the base model. Other metrics (BLEU-4, BERTscore, Rad-Graph F1) improved substantially for all fine-tuned models; however these reflect an improvement in lexical similarity, not clinical accuracy.

    Conclusions: Supervised fine-tuning can be used to improve the performance of open-weight chest x-ray VLMs on local patient data. It is necessary, however, to augment the training data with high-quality labels and fine-tune the model to perform chain-of-thought by inferring clinical labels prior to generating a structured radiology report. In the absence of this “Diagnose-Then-Describe” chain-of-thought, supervised fine-tuning tends to degrade the performance of existing chest x-ray VLMs.