Bert model fine-tuning for text classification in knee OA radiology reports

L Chen,R Shah,T Link,M Bucknor,S Majumdar,V Pedoia

doi:10.1016/j.joca.2020.02.488

Abstract

Purpose: Traditional Natural Language Processing (NLP) techniques do a good job of understanding relationships between adjacent or nearby words. However, clinical text data such as patient notes or radiology reports require capturing interactions between distant words. BERT (Bidirectional Encoder Representations from Transformers), a new form of deep neural network recently introduced by Google, overcomes this challenge given its ability to comprehend long range word interactions. While BERT has dramatically improved outcomes in NLP tasks in the general domain such as optimizing search results, its performance in domain specific tasks such as analyzing biomedical data is still being explored. We report the findings of using a pre-trained BERT model further trained on knee radiology reports with the aim of testing its ability to identify cartilage lesions in osteoarthritis (OA) patients, the first such effort in musculoskeletal imaging. Methods: 1521 MR scans of patients with osteoarthritis and corresponding radiology reports were assessed by the clinician for all 6 cartilage compartments (Patella, Trochlea, Medial Femur, Medial Tibia, Lateral Femur and Lateral Tibia) according to the Whole-Organ Magnetic Resonance Imaging Score (WORMS) system for knee OA (Figure 1). These gradings were further simplified into binary (Normal and Abnormal) classes (WORMS target). The clinician also assessed (blinded) 89 radiology reports and provided binary targets (NLP target). Radiology reports contains Technique, Findings and Impression sections, due to reports vary in length, with median 329 and max 731 tokens for our dataset, and BERT has 512 tokens limitation, Findings section was selected as dataset for both classification tasks, Logistic Regression and BERT fine-tuning. Both models have 80/20 train test data split, trains on all 6 compartments and use the binary WORMS target. For baseline Logistic Regression model, texts were cleaned and tokenized with Spacy then applied Term Frequency Inverse Document Frequency (TF-IDF) transforming texts into numeric vectors as input and trained with class weighted logistic regression. After the training process, 5-fold cross validation was performed to evaluate performance. To fine-tune BERT text classification task, reports were preprocessed with lower cased text, added special tokens to mark beginning and end sentences, padded if reports are shorter than 512 tokens, then generated vector representations with bert-base-uncased pre-trained model into training features along with input mask, and segment id. In evaluation step, the fine-tuned BERT models were then tested on each compartment for both WORMS target and NLP target. Results: For both training and testing datasets, the ratio of normal to abnormal cases varies. Patella has the most abnormal making it more balanced while MT has the least (Figure 2A). In the more balanced compartments such as Patella and Trochlea, LFC, BERT significantly improved the classification results compares to the baseline model. However, for the imbalanced dataset, even though BERT shows higher accuracy, it overfits and predicted Normal for all. This can be solved by hyper-parameter tuning and adjusting class weight in loss function for each model. Each fine-tuned BERT model is tested with NLP Target (labels on 89 radiology reports) and compared with the earlier WORMS target (gradings from images) (Figure 2B). Accuracy significantly improves on the NLP target with more balanced dataset (Figure 3). However, the overfit problem persists on MFC, MT, LT compartments. This implies the discrepancies between MR images and the content in the radiology reports, using NLP target which provides a more accurate annotations on reports will further improve the model’s prediction abilities. Conclusions: Our results in cartilage abnormality prediction esp. in compartments with balanced data holds promise and calls for further investigating use of BERT for domain specific tasks. The ability to develop a model which can accurately decipher text sentiment from radiology reports avoiding the need of time-consuming image annotation will accelerate and augment image based deep learning approaches for anomaly detection and improve clinical workflow.View Large Image Figure ViewerDownload Hi-res image Download (PPT)View Large Image Figure ViewerDownload Hi-res image Download (PPT)

Full Text