Validating large language model-assisted data extraction from clinical notes.

J W van Koevorden ,

N Aben ,

V Struben ,

A Rollings ,

M Kurban ,

L Wessels ,

V van der Noort ,

A V M Burger ,

S W van Dijk ,

L Karssemakers ,

L Smeele ,

M W J M Wouters ,

R Dirven

Abstract

PATIENTS AND METHODS

This was a prospective validation study comparing LLM-powered data extraction with manual extraction. Clinical documentation from 60 patients (1482 pages) was analyzed. Data were extracted for 29 clinically relevant categories by two physicians and a pretrained open-source LLM. Six clinical experts evaluated 2555 extracted values in a two-step procedure: first, a blinded binary Match/Non-Match assessment between LLM output and human consensus; and second, categorization of all Non-Matches into predefined error types (Incorrect, Incomplete, Missing, Hallucination, or Overcomplete). These expert-assigned labels formed the basis for accuracy, precision, recall, and F1 scores. Extraction times were compared using a paired Student's t-test, and error impact was scored on a 5-point Likert scale.

CONCLUSION

LLMs can support clinical workflows by reducing documentation time and maintaining acceptable accuracy, provided that human oversight is ensured. These findings support further exploration of AI-assisted documentation tools in clinical practice.

RESULTS

LLM-powered extraction achieved accuracies between 74% [pathology: 95% confidence interval (CI) 65% to 82%] and 90% (patient characteristics: 95% CI 87% to 94%). Manual extraction showed 29% interobserver disagreement (95% CI 25.19% to 32.74%). Of 2555 extracted values, 68 were rated as high-impact errors, although evaluator assessments varied widely. Hallucinations were rare (0.16%) and low impact. LLM extraction reduced average time per case from 8.6 minutes to 1.9 minutes (P < 0.001).

BACKGROUND

Health care professionals face increasing documentation burdens, which can compromise efficiency and patient safety. Large language models (LLMs) may offer a scalable solution by automating data extraction from unstructured clinical notes. This study evaluates the accuracy and clinical impact of structured data extraction by an LLM compared with manual extraction by physicians in the context of head and neck oncology consultations.

More about this publication

ESMO real world data and digital oncology

Volume 12

Pages 100718

Publication date 01-06-2026

Full text links

Publisher website (DOI) 10.1016/j.esmorw.2026.100718

Europe PubMed Central 42183150

Pubmed 42183150