Abstract Background and Aims Large language models (LLMs) have gained significant attention in the field of natural language processing (NLP), marking a shift from traditional techniques like Term Frequency-Inverse Document Frequency (TF-IDF). We developed a traditional NLP model to predict arteriovenous fistula (AVF) failure within next 30 days using clinical notes. The goal of this analysis was to investigate whether LLMs would outperform traditional NLP techniques, specifically in the context of predicting AVF failure within the next 30 days using clinical notes. Method We defined AVF failure as the change in status from active to permanently unusable status or temporarily unusable status. We used data from a large kidney care network from January 2021 to December 2021. Two models were created using LLMs and traditional TF-IDF technique. We used “distilbert-base-uncased”, a distilled version of BERT base model [1], and compared its performance with traditional TF-IDF-based NLP techniques. The dataset was randomly divided into 60% training, 20% validation and 20% test dataset. The test data, comprising of unseen patients’ data was used to evaluate the performance of the model. Both models were evaluated using metrics such as area under the receiver operating curve (AUROC), accuracy, sensitivity, and specificity. Results The incidence of 30 days AVF failure rate was 2.3% in the population. Both LLMs and traditional showed similar overall performance as summarized in Table 1. Notably, LLMs showed marginally better performance in certain evaluation metrics. Both models had same AUROC of 0.64 on test data. The accuracy and balanced accuracy for LLMs were 72.9% and 59.7%, respectively, compared to 70.9% and 59.6% for the traditional TF-IDF approach. In terms of specificity, LLMs scored 73.2%, slightly higher than the 71.2% observed for traditional NLP methods. However, LLMs had a lower sensitivity of 46.1% compared to 48% for traditional NLP. However, it is worth noting that training on LLMs took considerably longer than TF-IDF. Moreover, it also used higher computational resources such as utilization of graphics processing units (GPU) instances in cloud-based services, leading to higher cost. Conclusion In our study, we discovered that advanced LLMs perform comparably to traditional TF-IDF modeling techniques in predicting the failure of AVF. Both models demonstrated identical AUROC. While specificity was higher in LLMs compared to traditional NLP, sensitivity was higher in traditional NLP compared to LLMs. LLM was fine-tuned with a limited dataset, which could have influenced its performance to be similar to that of traditional NLP methods. This finding suggests that while LLMs may excel in certain scenarios, such as performing in-depth sentiment analysis of patient data for complex tasks, their effectiveness is highly dependent on the specific use case. It is crucial to weigh the benefits against the resources required for LLMs, as they can be significantly more resource-intensive and costly compared to traditional TF-IDF methods. This highlights the importance of a use-case-driven approach in selecting the appropriate NLP technique for healthcare applications.
Read full abstract