LEGALAI: INTEGRATED PLATFORM FOR AUTOMATED LEGAL ASSISTANCE AND LAWYER CONNECTION

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Access to legal information and professional assistance remains a significant challenge for a large segment of society. Legal consultations are often costly, legal language is complex, and awareness of procedures and rights is limited among ordinary citizens. As a result, many individuals delay or avoid seeking legal help, leading to unresolved disputes and preventable escalation of conflicts. Although digital transformation has revolutionized sectors such as healthcare, banking, and education, the legal industry continues to struggle with providing accessible, user-friendly, and technology-driven solutions. Most existing legal platforms function either as static informational websites or as lawyer directories, offering limited interaction and lacking intelligent, personalized assistance. This paper introduces LegalAI, an integrated platform designed to enhance access to justice through artificial intelligence and structured digital services. LegalAI combines AI-driven legal analysis, a verified lawyer marketplace, and case management tools within a single ecosystem. The platform leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to analyze usersubmitted legal problems written in natural language. It identifies relevant legal domains, references applicable laws, and generates simplified explanations tailored for non-experts. Additionally, the system recommends practical next steps, including documentation requirements, procedural guidance, and options for professional consultation. The platform is developed using React.js and Tailwind CSS to create a responsive and intuitive frontend interface. Backend operations are implemented using Python-based frameworks such as FastAPI or Flask, ensuring scalable and efficient API services. MongoDB is utilized for flexible and scalable data storage, supporting diverse user records and case information. Security is maintained through JWT-based authentication, enabling secure, role-based access control for users, lawyers, and administrators. Experimental evaluation indicates that LegalAI improves users preliminary legal understanding, reduces confusion during early dispute stages, and streamlines the process of identifying suitable legal professionals. By integrating intelligent assistance with verified lawyer connections, the platform demonstrates strong potential to modernize legal service delivery and bridge the gap between citizens and the legal system.

Similar Papers
  • Research Article
  • 10.1093/bjrai/ubaf010
Integrating NLP into Radiation Oncology: A Practical Guide to Transformer Architecture and Large Language Models
  • Aug 13, 2025
  • BJR|Artificial Intelligence
  • Reza Khanmohammadi + 10 more

Natural Language Processing (NLP) is a key technique for developing Medical Artificial Intelligence (AI) systems that leverage Electronic Health Record (EHR) data to build diagnostic and prognostic models. NLP enables the conversion of unstructured clinical text into structured data that can be fed into AI algorithms. The emergence of transformer architecture and large language models (LLMs) has led to advances in NLP for various healthcare tasks, such as entity recognition, relation extraction, sentence similarity, text summarization, and question-answering. In this article, we review the major technical innovations that underpin modern NLP models and present state-of-the-art NLP applications that employ LLMs in radiation oncology research. However, it is crucial to recognize that LLMs are prone to hallucinations, biases, and ethical violations, which necessitate rigorous evaluation and validation prior to clinical deployment. As such, we propose a comprehensive framework for assessing the NLP models based on their purpose and clinical fit, technical performance, bias and trust, legal and ethical implications, and quality assurance prior to implementation in clinical radiation oncology. Our article aims to provide guidance and insights for researchers and clinicians who are interested in developing and using NLP models in clinical radiation oncology. Natural Language Processing (NLP) is a key technique for developing Medical Artificial Intelligence (AI) systems that leverage Electronic Health Record (EHR) data to build diagnostic and prognostic models. NLP enables the conversion of unstructured clinical text into structured data that can be fed into AI algorithms. The emergence of transformer architecture and large language models (LLMs) has led to advances in NLP for various healthcare tasks, such as entity recognition, relation extraction, sentence similarity, text summarization, and question-answering. In this article, we review the major technical innovations that underpin modern NLP models and present state-of-the-art NLP applications that employ LLMs in radiation oncology research. However, it is crucial to recognize that LLMs are prone to hallucinations, biases, and ethical violations, which necessitate rigorous evaluation and validation prior to clinical deployment. As such, we propose a comprehensive framework for assessing the NLP models based on their purpose and clinical fit, technical performance, bias and trust, legal and ethical implications, and quality assurance prior to implementation in clinical radiation oncology. Our article aims to provide guidance and insights for researchers and clinicians who are interested in developing and using NLP models in clinical radiation oncology.

  • Research Article
  • Cite Count Icon 8
  • 10.1007/s00330-024-11148-x
Automated anonymization of radiology reports: comparison of publicly available natural language processing and large language models.
  • Oct 31, 2024
  • European radiology
  • Marcel C Langenbach + 7 more

Medical reports, governed by HIPAA regulations, contain personal health information (PHI), restricting secondary data use. Utilizing natural language processing (NLP) and large language models (LLM), we sought to employ publicly available methods to automatically anonymize PHI in free-text radiology reports. We compared two publicly available rule-based NLP models (spaCy; NLPac, accuracy-optimized; NLPsp, speed-optimized; iteratively improved on 400 free-text CT-reports (test set)) and one offline LLM approach (LLM-model, LLaMa-2, Meta-AI) for PHI-anonymization. The three models were tested on 100 randomly selected chest CT reports. Two investigators assessed the anonymization of occurring PHI entities and whether clinical information was removed. Subsequently, precision, recall, and F1 scores were calculated. NLPac and NLPsp successfully removed all instances of dates (n = 333), medical record numbers (MRN) (n = 6), and accession numbers (ACC) (n = 92). The LLM model removed all MRNs, 96% of ACCs, and 32% of dates. NLPac was most consistent with a perfect F1-score of 1.00, followed by NLPsp with lower precision (0.86) and F1-score (0.92) for dates. The LLM model had perfect precision for MRNs, ACCs, and dates but the lowest recall for ACC (0.96) and dates (0.52), corresponding F1 scores of 0.98 and 0.68, respectively. Names were removed completely or majorly (i.e., one first or family name non-anonymized) in 100% (NLPac), 72% (NLPsp), and 90% (LLM-model). Importantly, NLPac and NLPsp did not remove medical information, while the LLM model did in 10% (n = 10). Pre-trained NLP models can effectively anonymize free-text radiology reports, while anonymization with the LLM model is more prone to deleting medical information. Question This study compares NLP and locally hosted LLM techniques to ensure PHI anonymization without losing clinical information. Findings Pre-trained NLP models effectively anonymized radiology reports without removing clinical data, while a locally hosted LLM was less reliable, risking the loss of important information. Clinical relevance Fast, reliable, automated anonymization of PHI from radiology reports enables HIPAA-compliant secondary use, facilitating advanced applications like LLM-driven radiology analysis while ensuring ethical handling of sensitive patient data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.1038/s41598-025-08031-0
Comparing traditional natural language processing and large language models for mental health status classification: a multi-model evaluation
  • Jul 6, 2025
  • Scientific Reports
  • Thomas Kallstenius + 3 more

The substantial increase in mental health disorders globally necessitates scalable, accurate tools for detecting and classifying these conditions in digital environments. This study addresses the critical challenge of automated mental health classification by comparing three distinct computational approaches: (1) Traditional Natural Language Processing (NLP) with advanced feature engineering, (2) Prompt-engineered large language models (LLMs), and (3) Fine-tuned LLMs. The dataset consisted of over 51,000 publicly available text statements from social media platforms, tagged with seven mental health conditions: Normal, Depression, Suicidal, Anxiety, Stress, Bipolar Disorder, and Personality Disorder. The dataset was stratified into training, validation, and test sets for model evaluation. The primary outcome was classification accuracy across these seven mental health conditions. Additional metrics like precision, recall, and F1-score were analyzed. We compared the results of the three computational approaches and overfitting was monitored through validation loss across epochs for the fine-tuned LLM. The NLP model with advanced feature engineering achieved an overall accuracy of 95%, surpassing both the prompt-engineered LLM (65%) and the fine-tuned LLM (91%). This model performed exceptionally well in terms of accuracy and precision. While fine-tuning for three epochs yielded optimal results, further training led to overfitting and decreased performance. This study demonstrates the significant benefits of applying advanced text preprocessing and feature engineering techniques to traditional NLP models, alongside fine-tuning LLMs, such as GPT-4o-mini, for mental health classification tasks. The results clearly indicate that off-the-shelf LLM chatbots using prompt engineering are inadequate for mental health classification, performing 30% points below specialized NLP approaches. Despite the popularity of general-purpose LLMs, specialized approaches remain superior for critical healthcare applications like mental health classification.

  • Research Article
  • Cite Count Icon 7
  • 10.3389/fmed.2024.1512824
Emerging applications of NLP and large language models in gastroenterology and hepatology: a systematic review.
  • Jan 22, 2025
  • Frontiers in medicine
  • Mahmud Omar + 5 more

In the last years, natural language processing (NLP) has transformed significantly with the introduction of large language models (LLM). This review updates on NLP and LLM applications and challenges in gastroenterology and hepatology. Registered with PROSPERO (CRD42024542275) and adhering to PRISMA guidelines, we searched six databases for relevant studies published from 2003 to 2024, ultimately including 57 studies. Our review of 57 studies notes an increase in relevant publications in 2023-2024 compared to previous years, reflecting growing interest in newer models such as GPT-3 and GPT-4. The results demonstrate that NLP models have enhanced data extraction from electronic health records and other unstructured medical data sources. Key findings include high precision in identifying disease characteristics from unstructured reports and ongoing improvement in clinical decision-making. Risk of bias assessments using ROBINS-I, QUADAS-2, and PROBAST tools confirmed the methodological robustness of the included studies. NLP and LLMs can enhance diagnosis and treatment in gastroenterology and hepatology. They enable extraction of data from unstructured medical records, such as endoscopy reports and patient notes, and for enhancing clinical decision-making. Despite these advancements, integrating these tools into routine practice is still challenging. Future work should prospectively demonstrate real-world value.

  • Research Article
  • Cite Count Icon 6
  • 10.1161/circep.124.013023
Engineering of Generative Artificial Intelligence and Natural Language Processing Models to Accurately Identify Arrhythmia Recurrence.
  • Dec 16, 2024
  • Circulation. Arrhythmia and electrophysiology
  • Ruibin Feng + 16 more

Large language models (LLMs) such as Chat Generative Pre-trained Transformer (ChatGPT) excel at interpreting unstructured data from public sources, yet are limited when responding to queries on private repositories, such as electronic health records (EHRs). We hypothesized that prompt engineering could enhance the accuracy of LLMs for interpreting EHR data without requiring domain knowledge, thus expanding their utility for patients and personalized diagnostics. We designed and systematically tested prompt engineering techniques to improve the ability of LLMs to interpret EHRs for nuanced diagnostic questions, referenced to a panel of medical experts. In 490 full-text EHR notes from 125 patients with prior life-threatening heart rhythm disorders, we asked GPT-4-turbo to identify recurrent arrhythmias distinct from prior events and tested 220 563 queries. To provide context, results were compared with rule-based natural language processing and Bidirectional Encoder Representations from Transformer-based language models. Experiments were repeated for 2 additional LLMs. In an independent hold-out set of 389 notes, GPT-4-turbo had a balanced accuracy of 64.3%±4.7% out-of-the-box at baseline. This increased when asking GPT-4-turbo to provide a rationale for its answers, a structured data output, and in-context exemplars, to a balanced accuracy of 91.4%±3.8% (P<0.05). This surpassed the traditional logic-based natural language processing and BERT-based models (P<0.05). Results were consistent for GPT-3.5-turbo and Jurassic-2 LLMs. The use of prompt engineering strategies enables LLMs to identify clinical end points from EHRs with an accuracy that surpassed natural language processing and approximated experts, yet without the need for expert knowledge. These approaches could be applied to LLM queries for other domains, to facilitate automated analysis of nuanced data sets with high accuracy by nonexperts.

  • Book Chapter
  • 10.3233/atde250643
A Large Language Model Algorithm for Language Processing Technology Deviation Detection
  • Aug 28, 2025
  • Junze Li

With the wide application of large language models in natural language processing, problems such as data bias, improper algorithm selection and amplification of social bias have become increasingly prominent, and it is urgent to ensure its fairness and accuracy through technical improvement and ethical consideration. At present, the error detection technology in natural language processing has the problem of insufficient accuracy and efficiency, which seriously affects the reliability of language processing. Therefore, this study aims to build an efficient deviation detection system using large language model algorithm to improve the accuracy and robustness of natural language processing technology. In this paper, a deviation detection system based on speech processing technology is designed and implemented by using the algorithm of large language model in neural network. Its performance is verified by experiments. The experimental results show that the algorithm is easy to operate and the accuracy of deviation detection is as high as 96.76%. This study not only provides a new technical path for the application of large language models in natural language processing, but also provides an important reference for improving the fairness and accuracy of language processing technology. The results show that the large language model algorithm has significant advantages in deviation detection and can effectively solve the technical bottleneck in natural language processing.

  • Research Article
  • 10.1007/s10143-025-03785-7
Current trends and future prospects of language models and processing systems in spine surgery - a scoping review.
  • Sep 5, 2025
  • Neurosurgical review
  • Vivek Sanker + 9 more

Natural language processing (NLPs) and Large language models (LLM), such as ChatGPT, represent transformative advancements in artificial intelligence (AI). Their implementation into the medical field has a broad potential, and this review discusses the current trends and prospects of NLPs and LLMs in spine surgery, assessing their potential benefits, applications, and limitations. The methodology involved a comprehensive narrative review of existing English literature related to the use of NLPs and LLMs in spine surgery. We searched the databases PubMed, EMBASE, Web of Science and Scopus from inception until 16th June 2025 using keywords evolving around LLM, natural language processing and spine surgery. Original studies, clinical reports, and case series were included, while abstracts or unpublished studies were excluded. From 221 initial records, 37 studies were included: 18 evaluated LLMs and 19 evaluated NLP-based tools. LLMs were commonly used for clinical decision-making (n = 8), patient counseling (n = 7), classification (n = 2), and in research (n = 1). NLPs were applied in classification tasks (n = 12), clinical decision-making (n = 3), patient counseling (n = 1), postoperative opioid monitoring (n = 2), and research registry development (n = 1). ChatGPT-4 achieved up to 92% accuracy in clinical recommendations, outperforming GPT-3.5 in multiple tasks. Comparative analyses have found that newer versions of LLMs, such as ChatGPT-4, outperform previous versions, evident by greater accuracy and to a lesser extent of artificial hallucination. However, limitations persist, including overconfident outputs, adherence gaps to clinical guidelines, and inconsistent patient readability. While this review suggests that NLPs and LLMs can have a significant impact on spine practice, it is important to keep their limitations in mind and implement them with caution. To maximize the benefits of these models in spine surgery, future research should focus on improving model sensitivity and specificity, promoting multi-disciplinary collaborations, and addressing ethical considerations regarding the use of language models in medical practice, including the inherent issue of hallucination of these models.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 32
  • 10.3390/bdcc8060063
LLMs and NLP Models in Cryptocurrency Sentiment Analysis: A Comparative Classification Study
  • Jun 5, 2024
  • Big Data and Cognitive Computing
  • Konstantinos I Roumeliotis + 2 more

Cryptocurrencies are becoming increasingly prominent in financial investments, with more investors diversifying their portfolios and individuals drawn to their ease of use and decentralized financial opportunities. However, this accessibility also brings significant risks and rewards, often influenced by news and the sentiments of crypto investors, known as crypto signals. This paper explores the capabilities of large language models (LLMs) and natural language processing (NLP) models in analyzing sentiment from cryptocurrency-related news articles. We fine-tune state-of-the-art models such as GPT-4, BERT, and FinBERT for this specific task, evaluating their performance and comparing their effectiveness in sentiment classification. By leveraging these advanced techniques, we aim to enhance the understanding of sentiment dynamics in the cryptocurrency market, providing insights that can inform investment decisions and risk management strategies. The outcomes of this comparative study contribute to the broader discourse on applying advanced NLP models to cryptocurrency sentiment analysis, with implications for both academic research and practical applications in financial markets.

  • Research Article
  • 10.1016/j.jpainsymman.2025.09.025
Assessment of a Zero-Shot Large Language Model in Measuring Documented Goals-of-Care Discussions.
  • Jan 1, 2026
  • Journal of pain and symptom management
  • Robert Y Lee + 6 more

Assessment of a Zero-Shot Large Language Model in Measuring Documented Goals-of-Care Discussions.

  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.05.23.25328115
Assessment of a zero-shot large language model in measuring documented goals-of-care discussions
  • Sep 26, 2025
  • medRxiv
  • Robert Y Lee + 6 more

Context:Goals-of-care (GOC) discussions and their documentation are important process measures in palliative care. However, existing natural language processing (NLP) models for identifying such documentation require costly task-specific training data. Large language models (LLMs) hold promise for measuring such constructs with fewer or no task-specific training data.Objective:To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying documented GOC discussions.Methods:We compared performance of two NLP models in identifying documented GOC discussions: Llama 3.3 using zero-shot prompting; and, a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on 4,642 manually annotated notes. We tested both models on records from a series of clinical trials enrolling adult patients with chronic life-limiting illness hospitalized over 2018-2023. We evaluated the area under the receiver operating characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F1 score, for both note-level and patient-level classification over a 30-day period.Results:In our text corpora, GOC documentation represented <1% of text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation (Llama 3.3: AUC 0.979, AUPRC 0.873, and F1 0.83; BERT: AUC 0.981, AUPRC 0.874, and F1 0.83).Conclusion:A zero-shot large language model with no task-specific training performed similarly to a task-specific trained BERT model in identifying documented goals-of-care discussions. This demonstrates the promise of LLMs in measuring novel clinical research outcomes.

  • Research Article
  • 10.1093/ndt/gfae069.792
#2924 Comparison of large language models and traditional natural language processing techniques in predicting arteriovenous fistula failure
  • May 23, 2024
  • Nephrology Dialysis Transplantation
  • Suman Lama + 6 more

Background and Aims Large language models (LLMs) have gained significant attention in the field of natural language processing (NLP), marking a shift from traditional techniques like Term Frequency-Inverse Document Frequency (TF-IDF). We developed a traditional NLP model to predict arteriovenous fistula (AVF) failure within next 30 days using clinical notes. The goal of this analysis was to investigate whether LLMs would outperform traditional NLP techniques, specifically in the context of predicting AVF failure within the next 30 days using clinical notes. Method We defined AVF failure as the change in status from active to permanently unusable status or temporarily unusable status. We used data from a large kidney care network from January 2021 to December 2021. Two models were created using LLMs and traditional TF-IDF technique. We used “distilbert-base-uncased”, a distilled version of BERT base model [1], and compared its performance with traditional TF-IDF-based NLP techniques. The dataset was randomly divided into 60% training, 20% validation and 20% test dataset. The test data, comprising of unseen patients’ data was used to evaluate the performance of the model. Both models were evaluated using metrics such as area under the receiver operating curve (AUROC), accuracy, sensitivity, and specificity. Results The incidence of 30 days AVF failure rate was 2.3% in the population. Both LLMs and traditional showed similar overall performance as summarized in Table 1. Notably, LLMs showed marginally better performance in certain evaluation metrics. Both models had same AUROC of 0.64 on test data. The accuracy and balanced accuracy for LLMs were 72.9% and 59.7%, respectively, compared to 70.9% and 59.6% for the traditional TF-IDF approach. In terms of specificity, LLMs scored 73.2%, slightly higher than the 71.2% observed for traditional NLP methods. However, LLMs had a lower sensitivity of 46.1% compared to 48% for traditional NLP. However, it is worth noting that training on LLMs took considerably longer than TF-IDF. Moreover, it also used higher computational resources such as utilization of graphics processing units (GPU) instances in cloud-based services, leading to higher cost. Conclusion In our study, we discovered that advanced LLMs perform comparably to traditional TF-IDF modeling techniques in predicting the failure of AVF. Both models demonstrated identical AUROC. While specificity was higher in LLMs compared to traditional NLP, sensitivity was higher in traditional NLP compared to LLMs. LLM was fine-tuned with a limited dataset, which could have influenced its performance to be similar to that of traditional NLP methods. This finding suggests that while LLMs may excel in certain scenarios, such as performing in-depth sentiment analysis of patient data for complex tasks, their effectiveness is highly dependent on the specific use case. It is crucial to weigh the benefits against the resources required for LLMs, as they can be significantly more resource-intensive and costly compared to traditional TF-IDF methods. This highlights the importance of a use-case-driven approach in selecting the appropriate NLP technique for healthcare applications.

  • Research Article
  • Cite Count Icon 5
  • 10.1145/3749840
AdaptiveLog: An Adaptive Log Analysis Framework with the Collaboration of Large and Small Language Model
  • Jul 22, 2025
  • ACM Transactions on Software Engineering and Methodology
  • Lipeng Ma + 8 more

Automated log analysis is crucial to ensure the high availability and reliability of complex systems. The advent of large language models (LLMs) in natural language processing (NLP) has ushered in a new era of language model-driven automated log analysis, garnering significant interest. Within this field, two primary paradigms based on language models for log analysis have become prominent. Small Language Models (SLMs) (such as BERT) follow the pre-train and fine-tune paradigm, focusing on the specific log analysis task through fine-tuning on supervised datasets. On the other hand, LLMs (such as ChatGPT) following the in-context learning paradigm, analyze logs by providing a few examples in prompt contexts without updating parameters. Despite their respective strengths, both models exhibit inherent limitations. By comparing SLMs and LLMs, we notice that SLMs are more cost-effective but less powerful, whereas LLMs with large parameters are highly powerful but expensive and inefficient. To trade-off between the performance and inference costs of both models in automated log analysis, this paper introduces an adaptive log analysis framework known as AdaptiveLog, which effectively reduces the costs associated with LLM while ensuring superior results. This framework collaborates an LLM and a small language model, strategically allocating the LLM to tackle complex logs while delegating simpler logs to the SLM. Specifically, to efficiently query the LLM, we propose an adaptive selection strategy based on the uncertainty estimation of the SLM, where the LLM is invoked only when the SLM is uncertain. In addition, to enhance the reasoning ability of the LLM in log analysis tasks, we propose a novel prompt strategy by retrieving similar error-prone cases as the reference, enabling the model to leverage past error experiences and learn solutions from these cases. We evaluate AdaptiveLog on different log analysis tasks, extensive experiments demonstrate that AdaptiveLog achieves state-of-the-art results across different tasks, elevating the overall accuracy of log analysis while maintaining cost efficiency. Our source code and detailed experimental data are available at https://github.com/LeaperOvO/AdaptiveLog-review .

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 13
  • 10.54254/2755-2721/97/20241406
Advancements and Applications of Large Language Models in Natural Language Processing: A Comprehensive Review
  • Nov 26, 2024
  • Applied and Computational Engineering
  • Mengchao Ren

Abstract. Large language models (LLMs) have revolutionized the field of natural language processing (NLP), demonstrating remarkable capabilities in understanding, generating, and manipulating human language. This comprehensive review explores the development, applications, optimizations, and challenges of LLMs. This paper begin by tracing the evolution of these models and their foundational architectures, such as the Transformer, GPT, and BERT. We then delve into the applications of LLMs in natural language understanding tasks, including sentiment analysis, named entity recognition, question answering, and text summarization, highlighting real-world use cases. Next, we examine the role of LLMs in natural language generation, covering areas such as content creation, language translation, personalized recommendations, and automated responses. We further discuss LLM applications in other NLP tasks like text style transfer, text correction, and language model pre-training. Subsequently, we explore techniques for optimizing and improving LLMs, including model compression, explainability, robustness, and security. Finally, we address the challenges posed by the significant computational requirements, sample inefficiency, and ethical considerations surrounding LLMs. We conclude by discussing potential future research directions, such as efficient architectures, few-shot learning, bias mitigation, and privacy-preserving techniques, which will shape the ongoing development and responsible deployment of LLMs in NLP.

  • Research Article
  • Cite Count Icon 40
  • 10.1016/j.ajic.2024.03.016
Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review
  • Apr 6, 2024
  • AJIC: American Journal of Infection Control
  • Mahmud Omar + 3 more

Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review

  • Research Article
  • Cite Count Icon 1
  • 10.1111/1556-4029.70281
Improving drug identification in overdose death surveillance by using clinical natural language processing models.
  • Feb 8, 2026
  • Journal of forensic sciences
  • Arthur J Funnell + 8 more

The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple US jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3335 records from 2023 to 2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models (LLMs) such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores ≥0.998 on the internal test set. External validation confirmed robustness (macro F1 = 0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only LLMs. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant