LitAutoScreener: Development and Validation of an Automated Literature Screening Tool in Evidence-Based Medicine Driven by Large Language Models
Background: The traditional manual literature screening approach is limited by its time-consuming nature and high labor costs. A pressing issue is how to leverage large language models to enhance the efficiency and quality of evidence-based evaluations of drug efficacy and safety. Methods: This study utilized a manually curated reference literature database—comprising vaccine, hypoglycemic agent, and antidepressant evaluation studies—previously developed by our team through conventional systematic review methods. This validated database served as the gold standard for the development and optimization of LitAutoScreener. Following the PICOS (Population, Intervention, Comparison, Outcomes, Study Design) principles, a chain-of-thought reasoning approach with few-shot learning prompts was implemented to develop the screening algorithm. We subsequently evaluated the performance of LitAutoScreener using 2 independent validation cohorts, assessing both classification accuracy and processing efficiency. Results: For respiratory syncytial virus vaccine safety validation title–abstract screening, our tools based on GPT (GPT-4o), Kimi (moonshot-v1-128k), and DeepSeek (deepseek-chat 2.5) demonstrated high accuracy in inclusion/exclusion decisions (99.38%, 98.94%, and 98.85%, respectively). Recall rates were 100.00%, 99.13%, and 98.26%, with statistically significant performance differences (χ2 = 5.99, P = 0.048), where GPT outperformed the other models. Exclusion reason concordance rates were 98.85%, 94.79%, and 96.47% (χ2 = 30.22, P < 0.001). In full-text screening, all models maintained perfect recall (100.00%), with accuracies of 100.00% (GPT), 100.00% (Kimi), and 99.45% (DeepSeek). Processing times averaged 1 to 5 s per article for title–abstract screening and 60 s for full-text processing (including PDF preprocessing). Conclusions: LitAutoScreener offers a new approach for efficient literature screening in drug intervention studies, achieving high accuracy and significantly improving screening efficiency.
- Research Article
2
- 10.2196/67488
- Mar 11, 2025
- Journal of medical Internet research
Systematic reviews and meta-analyses rely on labor-intensive literature screening. While machine learning offers potential automation, its accuracy remains suboptimal. This raises the question of whether emerging large language models (LLMs) can provide a more accurate and efficient approach. This paper evaluates the sensitivity, specificity, and summary receiver operating characteristic (SROC) curve of LLM-assisted literature screening. We conducted a diagnostic study comparing the accuracy of LLM-assisted screening versus manual literature screening across 6 thoracic surgery meta-analyses. Manual screening by 2 investigators served as the reference standard. LLM-assisted screening was performed using ChatGPT-4o (OpenAI) and Claude-3.5 (Anthropic) sonnet, with discrepancies resolved by Gemini-1.5 pro (Google). In addition, 2 open-source, machine learning-based screening tools, ASReview (Utrecht University) and Abstrackr (Center for Evidence Synthesis in Health, Brown University School of Public Health), were also evaluated. We calculated sensitivity, specificity, and 95% CIs for the title and abstract, as well as full-text screening, generating pooled estimates and SROC curves. LLM prompts were revised based on a post hoc error analysis. LLM-assisted full-text screening demonstrated high pooled sensitivity (0.87, 95% CI 0.77-0.99) and specificity (0.96, 95% CI 0.91-0.98), with the area under the curve (AUC) of 0.96 (95% CI 0.94-0.97). Title and abstract screening achieved a pooled sensitivity of 0.73 (95% CI 0.57-0.85) and specificity of 0.99 (95% CI 0.97-0.99), with an AUC of 0.97 (95% CI 0.96-0.99). Post hoc revisions improved sensitivity to 0.98 (95% CI 0.74-1.00) while maintaining high specificity (0.98, 95% CI 0.94-0.99). In comparison, the pooled sensitivity and specificity of ASReview tool-assisted screening were 0.58 (95% CI 0.53-0.64) and 0.97 (95% CI 0.91-0.99), respectively, with an AUC of 0.66 (95% CI 0.62-0.70). The pooled sensitivity and specificity of Abstrackr tool-assisted screening were 0.48 (95% CI 0.35-0.62) and 0.96 (95% CI 0.88-0.99), respectively, with an AUC of 0.78 (95% CI 0.74-0.82). A post hoc meta-analysis revealed comparable effect sizes between LLM-assisted and conventional screening. LLMs hold significant potential for streamlining literature screening in systematic reviews, reducing workload without sacrificing quality. Importantly, LLMs outperformed traditional machine learning-based tools (ASReview and Abstrackr) in both sensitivity and AUC values, suggesting that LLMs offer a more accurate and efficient approach to literature screening.
- Supplementary Content
- 10.7759/cureus.90026
- Aug 13, 2025
- Cureus
Systematic and scoping reviews are essential in palliative care, yet they are time-consuming and resource-intensive. Recent advancements in artificial intelligence, particularly large language models (LLMs), have shown promise in enhancing the efficiency of literature screening. However, their feasibility and accuracy in scoping reviews remain unclear. In this study, we aimed to evaluate the feasibility and performance of LLM-assisted citation screening for a scoping review on nonpharmacological interventions for delirium in patients with cancer. This prospective simulation study assessed the accuracy of three LLMs, GPT-4 Turbo, GPT-4o, and model o1 (OpenAI, San Francisco, CA, USA), in screening titles and abstracts. The dataset was derived from a previously conducted scoping review. Two reference standards were used for comparison: title/abstract screening and full-text screening results from conventional human review. LLMs were prompted using standardized inclusion and exclusion criteria based on the Population, Concept, and Context (PCC) framework. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for each model. The sensitivity and specificity were 0.43 (95% CI, 0.06-0.80) and 0.99 (95% CI, 0.99-1.00) for GPT-4 Turbo, 0.71 (95% CI, 0.38-1.00) and 0.97 (95% CI, 0.96-0.98) for GPT-4o, and 1.00 (95% CI, 1.00-1.00) and 0.91 (95% CI, 0.89-0.92) for o1, respectively. Compared with reference standard 2 (full-text screening results from conventional citation screening), the sensitivity and specificity were 1.00 (95% CI, 1.00-1.00) and 0.99 (95% CI, 0.99-1.00) for GPT-4 Turbo, 1.00 (95% CI, 1.00-1.00) and 0.97 (95% CI, 0.96-0.98) for GPT-4o, and 1.00 (95% CI, 1.00-1.00) and 0.90 (95% CI, 0.89-0.92) for o1, respectively. All models demonstrated high NPVs, indicating strong reliability in excluding irrelevant studies. However, PPVs were low across all models, reflecting a high false-positive rate. Newer LLMs, particularly model o1, demonstrated high sensitivity and acceptable specificity, supporting their use as preliminary screening tools in scoping reviews. High NPVs suggest LLMs are reliable for ruling out irrelevant citations, thereby streamlining the initial screening phase. However, consistently low PPVs raise concerns about increased reviewer burden due to false positives, emphasizing the necessity of human validation. These findings support the cautious integration of LLMs into literature screening workflows, treating their outputs as supportive tools rather than replacements for expert judgment.
- Research Article
- 10.1016/j.ijmedinf.2025.106048
- Dec 1, 2025
- International journal of medical informatics
Leveraging open-source large language models (LLMs) in scoping reviews: a case study on disability and AI applications.
- Research Article
22
- 10.1001/jamanetworkopen.2024.20496
- Jul 8, 2024
- JAMA Network Open
Large language models (LLMs) are promising as tools for citation screening in systematic reviews. However, their applicability has not yet been determined. To evaluate the accuracy and efficiency of an LLM in title and abstract literature screening. This prospective diagnostic study used the data from the title and abstract screening process for 5 clinical questions (CQs) in the development of the Japanese Clinical Practice Guidelines for Management of Sepsis and Septic Shock. The LLM decided to include or exclude citations based on the inclusion and exclusion criteria in terms of patient, population, problem; intervention; comparison; and study design of the selected CQ and was compared with the conventional method for title and abstract screening. This study was conducted from January 7 to 15, 2024. LLM (GPT-4 Turbo)-assisted citation screening or the conventional method. The sensitivity and specificity of the LLM-assisted screening process was calculated, and the full-text screening result using the conventional method was set as the reference standard in the primary analysis. Pooled sensitivity and specificity were also estimated, and screening times of the 2 methods were compared. In the conventional citation screening process, 8 of 5634 publications in CQ 1, 4 of 3418 in CQ 2, 4 of 1038 in CQ 3, 17 of 4326 in CQ 4, and 8 of 2253 in CQ 5 were selected. In the primary analysis of 5 CQs, LLM-assisted citation screening demonstrated an integrated sensitivity of 0.75 (95% CI, 0.43 to 0.92) and specificity of 0.99 (95% CI, 0.99 to 0.99). Post hoc modifications to the command prompt improved the integrated sensitivity to 0.91 (95% CI, 0.77 to 0.97) without substantially compromising specificity (0.98 [95% CI, 0.96 to 0.99]). Additionally, LLM-assisted screening was associated with reduced time for processing 100 studies (1.3 minutes vs 17.2 minutes for conventional screening methods; mean difference, -15.25 minutes [95% CI, -17.70 to -12.79 minutes]). In this prospective diagnostic study investigating the performance of LLM-assisted citation screening, the model demonstrated acceptable sensitivity and reasonably high specificity with reduced processing time. This novel method could potentially enhance efficiency and reduce workload in systematic reviews.
- Research Article
6
- 10.3390/app14199103
- Oct 9, 2024
- Applied Sciences
This study examines Retrieval-Augmented Generation (RAG) in large language models (LLMs) and their significant application for undertaking systematic literature reviews (SLRs). RAG-based LLMs can potentially automate tasks like data extraction, summarization, and trend identification. However, while LLMs are exceptionally proficient in generating human-like text and interpreting complex linguistic nuances, their dependence on static, pre-trained knowledge can result in inaccuracies and hallucinations. RAG mitigates these limitations by integrating LLMs’ generative capabilities with the precision of real-time information retrieval. We review in detail the three key processes of the RAG framework—retrieval, augmentation, and generation. We then discuss applications of RAG-based LLMs to SLR automation and highlight future research topics, including integration of domain-specific LLMs, multimodal data processing and generation, and utilization of multiple retrieval sources. We propose a framework of RAG-based LLMs for automating SRLs, which covers four stages of SLR process: literature search, literature screening, data extraction, and information synthesis. Future research aims to optimize the interaction between LLM selection, training strategies, RAG techniques, and prompt engineering to implement the proposed framework, with particular emphasis on the retrieval of information from individual scientific papers and the integration of these data to produce outputs addressing various aspects such as current status, existing gaps, and emerging trends.
- Research Article
- 10.1136/bmjdhai-2025-000017
- Oct 1, 2025
- BMJ Digital Health & AI
Objective To evaluate and synthesise current applications of large language models (LLMs) in systematic reviews and meta-analyses (SRMAs), identify key limitations and propose an enhanced theoretical framework to improve the efficiency, scalability and reliability of evidence synthesis. Methods and analysis We conducted a narrative review of recent studies applying LLMs across key SRMA stages. A total of 21 publications were analysed for model type, task application, accuracy metrics and workflow impact. Building on this evidence base, we designed a comprehensive LLM-enhanced SRMA framework that categorises LLM roles as consultants and assistants, integrates human-in-the-loop strategies and uses retrieval-augmented generation (RAG) and agent-based architectures to address critical challenges including hallucinations, bias and workflow inefficiency. Results The reviewed literature demonstrated that LLMs can support various SRMA tasks with reported accuracy ranging from 61% to 99%, showing particular promise in literature screening and data extraction. Our proposed framework conceptualises modular integration of LLMs across all six SRMA stages, with LLMs serving as consultants for research question formulation and search strategy development and as assistants for task automation including abstract screening and structured data extraction. The framework incorporates RAG technology to reduce hallucinations by grounding outputs in retrieved literature and employs agent-based orchestration for complex analytical workflows. Theoretical analysis suggests potential for significant efficiency gains while maintaining methodological rigour through strategic human oversight. Conclusion LLMs offer substantial theoretical potential to transform evidence synthesis by improving efficiency, scalability and consistency across SRMA workflows. The proposed LLM-enhanced framework provides a systematic, theoretically grounded approach for integrating advanced artificial intelligence capabilities into existing SRMA methodologies while preserving essential human oversight and analytical integrity. Future empirical studies are needed to validate the framework’s practical effectiveness, establish implementation protocols and demonstrate real-world benefits in evidence-based medicine.
- Research Article
7
- 10.7326/annals-24-02189
- Feb 25, 2025
- Annals of internal medicine
Systematic reviews (SRs) are hindered by the initial rigorous article screen, which delays access to reliable information synthesis. To develop generic prompt templates for large language model (LLM)-driven abstract and full-text screening that can be adapted to different reviews. Diagnostic test accuracy. 48 425 citations were tested for abstract screening across 10 SRs. Full-text screening evaluated all 12 690 freely available articles from the original search. Prompt development used the GPT4-0125-preview model (OpenAI). None. Large language models were prompted to include or exclude articles based on SR eligibility criteria. Model outputs were compared with original SR author decisions after full-text screening to evaluate performance (accuracy, sensitivity, and specificity). Optimized prompts using GPT4-0125-preview achieved a weighted sensitivity of 97.7% (range, 86.7% to 100%) and specificity of 85.2% (range, 68.3% to 95.9%) in abstract screening and weighted sensitivity of 96.5% (range, 89.7% to 100.0%) and specificity of 91.2% (range, 80.7% to 100%) in full-text screening across 10 SRs. In contrast, zero-shot prompts had poor sensitivity (49.0% abstract, 49.1% full-text). Across LLMs, Claude-3.5 (Anthropic) and GPT4 variants had similar performance, whereas Gemini Pro (Google) and GPT3.5 (OpenAI) models underperformed. Direct screening costs for 10 000 citations differed substantially: Where single human abstract screening was estimated to require more than 83 hours and $1666.67 USD, our LLM-based approach completed screening in under 1 day for $157.02 USD. Further prompt optimizations may exist. Retrospective study. Convenience sample of SRs. Full-text screening evaluations were limited to free PubMed Central full-text articles. A generic prompt for abstract and full-text screening achieving high sensitivity and specificity that can be adapted to other SRs and LLMs was developed. Our prompting innovations may have value to SR investigators and researchers conducting similar criteria-based tasks across the medical sciences. None.
- Research Article
- 10.1016/j.adaj.2025.08.011
- Oct 1, 2025
- Journal of the American Dental Association (1939)
Comparing the performance of ChatGPT, DeepSeek, and Gemini in systematic and umbrella review tasks over time.
- Research Article
1
- 10.1038/s41598-025-89996-w
- Feb 20, 2025
- Scientific Reports
Large language models (LLMs) can improve text analysis efficiency in healthcare. This study explores the application of LLMs to analyze patient perspectives within the exception from informed consent (EFIC) process, which waives consent in emergency research. Our objective is to assess whether LLMs can analyze patient perspectives in EFIC interviews with performance comparable to human reviewers. We analyzed 102 EFIC community interviews from 9 sites, each with 46 questions, as part of the Pediatric Dose Optimization for Seizures in Emergency Medical Services study. We evaluated 5 LLMs, including GPT-4, to assess sentiment polarity on a 5-point scale and classify responses into predefined thematic classes. Three human reviewers conducted parallel analyses, with agreement measured by Cohen’s Kappa and classification accuracy. Polarity scores between LLM and human reviewers showed substantial agreement (Cohen’s kappa: 0.69, 95% CI 0.61–0.76), with major discrepancies in only 4.7% of responses. LLM achieved high thematic classification accuracy (0.868, 95% CI 0.853–0.881), comparable to inter-rater agreement among human reviewers (0.867, 95% CI 0.836–0.901). LLMs enabled large-scale visual analysis, comparing response statistics across sites, questions, and classes. LLMs efficiently analyzed patient perspectives in EFIC interviews, showing substantial sentiment assessment and thematic classification performance. However, occasional underperformance suggests LLMs should complement, not replace, human judgment. Future work should evaluate LLM integration in EFIC to enhance efficiency, reduce subjectivity, and support accurate patient perspective analysis.
- Research Article
- 10.1186/s12874-025-02583-5
- May 10, 2025
- BMC Medical Research Methodology
BackgroundSystematic reviews (SRs) are essential to formulate evidence-based guidelines but require time-consuming and costly literature screening. Large Language Models (LLMs) can be a powerful tool to expedite SRs.MethodsWe conducted a comparative study to evaluate the performance of a commercial tool, Rayyan, and an in-house LLM-based system in automating the screening of a completed SR on Vitamin D and falls. The SR retrieved 14,439 articles, and Rayyan was trained with 2,000 manually screened articles to categorize the rest as most likely to exclude/include, likely to exclude/include and undecided. We analyzed Rayyan’s title/abstract screening performance using different inclusion thresholds. For the LLM, we used prompt engineering for title/abstract screening and Retrieval-Augmented Generation (RAG) for full-text screening. We evaluated performance using article exclusion rate (AER), false negative rate (FNR), specificity, positive predictive value (PPV), and negative predictive value (NPV). Additionally, we compared the time required to complete screening steps of the SR using both approaches against the manual screening method.ResultsUsing Rayyan, including considered as undecided or likely to include for title/abstract screening resulted in an AER of 72.1% and an FNR of 5%. The total estimated screening time, including manual review of articles flagged by Rayyan, was 54.7 hours. Lowering the Rayyan threshold to ‘likely to exclude’ reduced the FNR to 0% and the AER to 50.7%, but increased the screening time to 81.3 h. Using the LLM system, after title/abstract and full-text screening, 78 articles remained for manual review, including all 20 identified by traditional methods. The LLM achieved an AER of 99.5%, specificity of 99.6%, PPV of 25.6%, and NPV of 100%, with a total screening time of 25.5 h, including manual review of the 78 articles, reducing the manual screening time by 95.5%.ConclusionsThe LLM-based system significantly enhances SR efficiency, compared to manual methods and Rayyan while maintaining low FNR.
- Research Article
8
- 10.1038/s41598-024-81370-6
- Dec 30, 2024
- Scientific Reports
With breakthroughs in Natural Language Processing and Artificial Intelligence (AI), the usage of Large Language Models (LLMs) in academic research has increased tremendously. Models such as Generative Pre-trained Transformer (GPT) are used by researchers in literature review, abstract screening, and manuscript drafting. However, these models also present the attendant challenge of providing ethically questionable scientific information. Our study provides a snapshot of global researchers’ perception of current trends and future impacts of LLMs in research. Using a cross-sectional design, we surveyed 226 medical and paramedical researchers from 59 countries across 65 specialties, trained in the Global Clinical Scholars’ Research Training certificate program of Harvard Medical School between 2020 and 2024. Majority (57.5%) of these participants practiced in an academic setting with a median of 7 (2,18) PubMed Indexed published articles. 198 respondents (87.6%) were aware of LLMs and those who were aware had higher number of publications (p < 0.001). 18.7% of the respondents who were aware (n = 37) had previously used LLMs in publications especially for grammatical errors and formatting (64.9%); however, most (40.5%) did not acknowledge its use in their papers. 50.8% of aware respondents (n = 95) predicted an overall positive future impact of LLMs while 32.6% were unsure of its scope. 52% of aware respondents (n = 102) believed that LLMs would have a major impact in areas such as grammatical errors and formatting (66.3%), revision and editing (57.2%), writing (57.2%) and literature review (54.2%). 58.1% of aware respondents were opined that journals should allow for use of AI in research and 78.3% believed that regulations should be put in place to avoid its abuse. Seeing the perception of researchers towards LLMs and the significant association between awareness of LLMs and number of published works, we emphasize the importance of developing comprehensive guidelines and ethical framework to govern the use of AI in academic research and address the current challenges.
- Research Article
- 10.2196/69286
- Jul 24, 2025
- JMIR medical informatics
Several clinical cases and experiments have demonstrated the effectiveness of traditional Chinese medicine (TCM) formulas in treating and preventing diseases. These formulas contain critical information about their ingredients, efficacy, and indications. Classifying TCM formulas based on this information can effectively standardize TCM formulas management, support clinical and research applications, and promote the modernization and scientific use of TCM. To further advance this task, TCM formulas can be classified using various approaches, including manual classification, machine learning, and deep learning. Additionally, large language models (LLMs) are gaining prominence in the biomedical field. Integrating LLMs into TCM research could significantly enhance and accelerate the discovery of TCM knowledge by leveraging their advanced linguistic understanding and contextual reasoning capabilities. The objective of this study is to evaluate the performance of different LLMs in the TCM formula classification task. Additionally, by employing ensemble learning with multiple fine-tuned LLMs, this study aims to enhance classification accuracy. The data for the TCM formula were manually refined and cleaned. We selected 10 LLMs that support Chinese for fine-tuning. We then employed an ensemble learning approach that combined the predictions of multiple models using both hard and weighted voting, with weights determined by the average accuracy of each model. Finally, we selected the top 5 most effective models from each series of LLMs for weighted voting (top 5) and the top 3 most accurate models of 10 for weighted voting (top 3). A total of 2441 TCM formulas were curated manually from multiple sources, including the Coding Rules for Chinese Medicinal Formulas and Their Codes, the Chinese National Medical Insurance Catalog for proprietary Chinese medicines, textbooks of TCM formulas, and TCM literature. The dataset was divided into a training set of 1999 TCM formulas and test set of 442 TCM formulas. The testing results showed that Qwen-14B achieved the highest accuracy of 75.32% among the single models. The accuracy rates for hard voting, weighted voting, weighted voting (top 5), and weighted voting (top 3) were 75.79%, 76.47%, 75.57%, and 77.15%, respectively. This study aims to explore the effectiveness of LLMs in the TCM formula classification task. To this end, we propose an ensemble learning method that integrates multiple fine-tuned LLMs through a voting mechanism. This method not only improves classification accuracy but also enhances the existing classification system for classifying the efficacy of TCM formula.
- Research Article
2
- 10.1016/j.artmed.2024.103009
- Oct 31, 2024
- Artificial Intelligence In Medicine
Pre-trained Large Language Models (LLMs) have revolutionised Natural Language Processing (NLP) tasks, but often struggle when applied to specialised domains such as healthcare. The traditional approach of pre-training on large datasets followed by task-specific fine-tuning is resource-intensive and poorly aligned with the constraints of many healthcare settings. This presents a significant challenge for deploying LLM-based NLP solutions in medical contexts, where data privacy, computational resources, and domain-specific language pose unique obstacles.This study aims to develop and evaluate efficient methods for adapting smaller LLMs to healthcare-specific datasets and tasks. We seek to identify pre-training approaches that can effectively instil healthcare competency in compact LLMs under tight computational budgets, a crucial capability for responsible and sustainable deployment in local healthcare settings.We explore three specialised pre-training methods to adapt smaller LLMs to different healthcare datasets: traditional Masked Language modelling (MLM), Deep Contrastive Learning for Unsupervised Textual Representations (DeCLUTR), and a novel approach utilising metadata categories from healthcare settings. These methods are assessed across multiple healthcare datasets, with a focus on downstream document classification tasks. We evaluate the performance of the resulting LLMs through classification accuracy and analysis of the derived embedding spaces.Contrastively trained models consistently outperform other approaches on classification tasks, delivering strong performance with limited labelled data and fewer model parameter updates. While our novel metadata-based pre-training does not further improve classifications across datasets, it yields interesting embedding cluster separability. Importantly, all domain-adapted LLMs outperform their publicly available, general-purpose base models, validating the importance of domain specialisation.This research demonstrates the efficacy of specialised pre-training methods in adapting compact LLMs to healthcare tasks, even under resource constraints. We provide guidelines for pre-training specialised healthcare LLMs and motivate continued inquiry into contrastive objectives. Our findings underscore the potential of these approaches for aligning small LLMs with privacy-sensitive medical tasks, offering a path toward more efficient and responsible NLP deployment in healthcare settings. This work contributes to the broader goal of making advanced NLP techniques accessible and effective in specialised domains, particularly where resource limitations and data sensitivity are significant concerns.
- Research Article
60
- 10.1007/s40547-024-00143-4
- Mar 5, 2024
- Customer Needs and Solutions
In the rapidly advancing age of Generative AI, Large Language Models (LLMs) such as ChatGPT stand at the forefront of disrupting marketing practice and research. This paper presents a comprehensive exploration of LLMs’ proficiency in sentiment analysis, a core task in marketing research for understanding consumer emotions, opinions, and perceptions. We benchmark the performance of three state-of-the-art LLMs, i.e., GPT-3.5, GPT-4, and Llama 2, against established, high-performing transfer learning models. Despite their zero-shot nature, our research reveals that LLMs can not only compete with but in some cases also surpass traditional transfer learning methods in terms of sentiment classification accuracy. We investigate the influence of textual data characteristics and analytical procedures on classification accuracy, shedding light on how data origin, text complexity, and prompting techniques impact LLM performance. We find that linguistic features such as the presence of lengthy, content-laden words improve classification performance, while other features such as single-sentence reviews and less structured social media text documents reduce performance. Further, we explore the explainability of sentiment classifications generated by LLMs. The findings indicate that LLMs, especially Llama 2, offer remarkable classification explanations, highlighting their advanced human-like reasoning capabilities. Collectively, this paper enriches the current understanding of sentiment analysis, providing valuable insights and guidance for the selection of suitable methods by marketing researchers and practitioners in the age of Generative AI.
- Research Article
4
- 10.1073/pnas.2411962122
- Jan 6, 2025
- Proceedings of the National Academy of Sciences
Systematic reviews (SR) synthesize evidence-based medical literature, but they involve labor-intensive manual article screening. Large language models (LLMs) can select relevant literature, but their quality and efficacy are still being determined compared to humans. We evaluated the overlap between title- and abstract-based selected articles of 18 different LLMs and human-selected articles for three SR. In the three SRs, 185/4,662, 122/1,741, and 45/66 articles have been selected and considered for full-text screening by two independent reviewers. Due to technical variations and the inability of the LLMs to classify all records, the LLM's considered sample sizes were smaller. However, on average, the 18 LLMs classified 4,294 (min 4,130; max 4,329), 1,539 (min 1,449; max 1,574), and 27 (min 22; max 37) of the titles and abstracts correctly as either included or excluded for the three SRs, respectively. Additional analysis revealed that the definitions of the inclusion criteria and conceptual designs significantly influenced the LLM performances. In conclusion, LLMs can reduce one reviewer´s workload between 33% and 93% during title and abstract screening. However, the exact formulation of the inclusion and exclusion criteria should be refined beforehand for ideal support of the LLMs.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.