USE OF NATURAL LANGUAGE PROCESSING TECHNIQUES TO PREDICT PATIENT SELECTION FOR TOTAL HIP ARTHROPLASTY: RESULTS FROM THE AI TO REVOLUTIONISE THE PATIENT CARE PATHWAY IN HIP AND KNEE ARTHROPLASTY (ARCHERY) PROJECT

Luke Farrow,Lesley Anderson,Mingjun Zhong

doi:10.1302/1358-992x.2024.6.034

Luke Farrow, Lesley Anderson + Show 1 more

https://doi.org/10.1302/1358-992x.2024.6.034

Copy DOI

Export

Save

Cite

Journal: Orthopaedic Proceedings

Publication Date: May 2, 2024

Abstract
Full-Text
Similar Papers

Abstract

Listen

To examine whether Natural Language Processing (NLP) using a state-of-the-art clinically based Large Language Model (LLM) could predict patient selection for Total Hip Arthroplasty (THA), across a range of routinely available clinical text sources.Data pre-processing and analyses were conducted according to the Ai to Revolutionise the patient Care pathway in Hip and Knee arthroplasty (ARCHERY) project protocol (https://www.researchprotocols.org/2022/5/e37092/). Three types of deidentified Scottish regional clinical free text data were assessed: Referral letters, radiology reports and clinic letters. NLP algorithms were based on the GatorTron model, a Bidirectional Encoder Representations from Transformers (BERT) based LLM trained on 82 billion words of de-identified clinical text. Three specific inference tasks were performed: assessment of the base GatorTron model, assessment after model-fine tuning, and external validation.There were 3911, 1621 and 1503 patient text documents included from the sources of referral letters, radiology reports and clinic letters respectively. All letter sources displayed significant class imbalance, with only 15.8%, 24.9%, and 5.9% of patients linked to the respective text source documentation having undergone surgery. Untrained model performance was poor, with F1 scores (harmonic mean of precision and recall) of 0.02, 0.38 and 0.09 respectively. This did however improve with model training, with mean scores (range) of 0.39 (0.31–0.47), 0.57 (0.48–0.63) and 0.32 (0.28–0.39) across the 5 folds of cross-validation. Performance deteriorated on external validation across all three groups but remained highest for the radiology report cohort.Even with further training on a large cohort of routinely collected free-text data a clinical LLM fails to adequately perform clinical inference in NLP tasks regarding identification of those selected to undergo THA. This likely relates to the complexity and heterogeneity of free-text information and the way that patients are determined to be surgical candidates.

Full Text