INTELIGÊNCIA ARTIFICIAL NO DIAGNÓSTICO DA ESCOLIOSE: UM ESTUDO COMPARATIVO ENTRE CHATGPT E CIRURGIÕES

  • Abstract
  • Literature Map
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Objective: This study explores the accuracy of ChatGPT in classifying and suggesting approaches for adolescent idiopathic scoliosis, assessing the level of agreement between the artificial intelligence model’s responses and the evaluations of spine surgery specialists. It aims to help answer the following question: Is it possible to trust ChatGPT-4 (natural language artificial intelligence) to recommend approaches for typical everyday cases, aiding less experienced orthopedists or even general practitioners? The proposed analysis seeks to identify the potential and limitations of applying artificial intelligence to support diagnosis and clinical decision-making without prior training of the platform. Methods: This is a cross-sectional study involving five fictitious cases of idiopathic scoliosis presented to ChatGPT, which provided the Lenke classification and a suggested approach for each case. A panel of 37 surgeons evaluated the responses, determined the best approach, and scored ChatGPT’s recommendations on a Likert scale from 1 to 5, reflecting their level of agreement. Results: In simpler cases (Case 1), ChatGPT showed high agreement with the specialists, with 97.3% of the surgeons agreeing with the recommendation of “instrumentation surgery” (AC1=0.95). However, agreement was significantly lower in more complex cases (Cases 3 and 5), with only 11.1% and 18.8% of the specialists accepting the Al’s recommendations, respectively. The model’s accuracy in the Lenke classification was consistent across all cases, demonstrating its ability to apply standardized criteria. There was no significant correlation between the surgeons’ experience and their level of agreement with the software. Conclusion: ChatGPT showed potential as an auxiliary tool in the diagnosis and therapeutic planning of scoliosis, particularly in classification, but it is not yet ready to be used reliably and consistently, especially in more complex cases, particularly when considering clinical nuances and individual patient factors. Although promising, the adoption of this technology can complement clinical judgment but still requires supervision and does not replace the role of specialized medical evaluation in the current scenario. Level of Evidence IV; Descriptive Observational Studies.

ReferencesShowing 10 of 20 papers
  • Cite Count Icon 163
  • 10.1016/s1473-3099(23)00113-5
ChatGPT and antimicrobial advice: the end of the consulting infection doctor?
  • Apr 1, 2023
  • The Lancet Infectious Diseases
  • Alex Howard + 2 more

  • Cite Count Icon 108
  • 10.1097/01.bpb.0000271331.71857.9a
A Comparison of the Thoracolumbosacral Orthoses and Providence Orthosis in the Treatment of Adolescent Idiopathic Scoliosis
  • Jun 1, 2007
  • Journal of Pediatric Orthopaedics
  • Joseph A Janicki + 3 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 26
  • 10.1007/s10439-023-03335-6
A Promising Start and Not a Panacea: ChatGPT's Early Impact and Potential in Medical Science and Biomedical Engineering Research.
  • Aug 4, 2023
  • Annals of Biomedical Engineering
  • Shahab Saquib Sohail

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 1433
  • 10.2196/45312
How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment
  • Feb 8, 2023
  • JMIR Medical Education
  • Aidan Gilson + 6 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 348
  • 10.1007/s13244-018-0645-y
Artificial intelligence as a medical device in radiology: ethical and regulatory issues in Europe and the United States
  • Aug 15, 2018
  • Insights into Imaging
  • Filippo Pesapane + 3 more

  • Open Access Icon
  • Cite Count Icon 1053
  • 10.3348/kjr.2017.18.4.570
Deep Learning in Medical Imaging: General Overview
  • Jan 1, 2017
  • Korean Journal of Radiology
  • June-Goo Lee + 6 more

  • Cite Count Icon 137
  • 10.1007/s00701-017-3385-8
An introduction and overview of machine learning in neurosurgical care.
  • Nov 13, 2017
  • Acta neurochirurgica
  • Joeky T Senders + 7 more

  • Cite Count Icon 46
  • 10.1093/eurheartj/ehae465
Artificial intelligence in cardiovascular medicine: clinical applications.
  • Aug 19, 2024
  • European heart journal
  • Thomas F Lüscher + 4 more

  • Open Access Icon
  • Cite Count Icon 494
  • 10.1089/tmj.2020.29040.rb
Telemedicine and the COVID-19 Pandemic, Lessons for the Future.
  • Apr 8, 2020
  • Telemedicine and e-Health
  • Rashid Bashshur + 4 more

  • Cite Count Icon 48
  • 10.1007/s10439-023-03206-0
Will ChatGPT/GPT-4 be a Lighthouse to Guide Spinal Surgeons?
  • Apr 18, 2023
  • Annals of Biomedical Engineering
  • Yongbin He + 5 more

Similar Papers
  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.jacr.2021.06.025
Real-World Surveillance of FDA-Cleared Artificial Intelligence Models: Rationale and Logistics.
  • Feb 1, 2022
  • Journal of the American College of Radiology
  • Keith J Dreyer + 2 more

Real-World Surveillance of FDA-Cleared Artificial Intelligence Models: Rationale and Logistics.

  • Research Article
  • 10.1002/cre2.70150
Artificial Intelligence and Hand Hygiene Accuracy: A New Era in Infection Control for Dental Practices
  • May 26, 2025
  • Clinical and Experimental Dental Research
  • Salwa A Aldahlawi + 2 more

ABSTRACTObjectiveThe study aimed to assess the efficacy of an artificial intelligence (AI) model in evaluating hand hygiene (HH) performance compared to infection control auditors in dental clinics.Material and MethodThe AI model utilized a pretrained convolutional neural network (CNN) and was fine‐tuned on a custom data set of videos showing dental students performing alcohol‐based hand rub (ABHR) procedures. A total of 66 videos were recorded, with 33 used for training and 11 for validating the model. The remaining 22 videos were designated for testing and the AI‐ infection control auditors comparison experiment. Two infection control auditors assessed the HH performance videos using a standardized checklist. The model's performance was evaluated through precision, recall, and F1 score across various classes. The level of agreement between the auditors and the AI assessments was measured using Cohen's kappa, and the sensitivity and specificity of the AI were compared to those of the infection control auditors.ResultsThe AI model has learned to differentiate between classes of hand movement, with an overall F1 score of 0.85. Results showed a 90.91% agreement rate between the AI model and infection control auditors in evaluating HH steps, with a sensitivity of 85.7% and specificity of 100% in identifying acceptable HH practices. Step 3 (back of fingers to opposing palm with fingers interlocked) was consistently identified as the most frequently missed step by both the AI model and the infection control auditors.ConclusionThe AI model assessment of HH performance closely matched auditors' evaluations, suggesting its reliability as a tool for evaluating and mentoring HH in dental clinics. Future research should explore the application of AI technology in different dental settings to further validate its feasibility and adaptability.

  • Research Article
  • Cite Count Icon 4
  • 10.1111/os.14144
Validation of Artificial Intelligence in the Classification of Adolescent Idiopathic Scoliosis and the Compairment to Clinical Manual Handling.
  • Jul 3, 2024
  • Orthopaedic surgery
  • Lu Tingsheng + 7 more

The accurate measurement of Cobb angles is crucial for the effective clinical management of patients with adolescent idiopathic scoliosis (AIS). The Lenke classification system plays a pivotal role in determining the appropriate fusion levels for treatment planning. However, the presence of interobserver variability and time-intensive procedures presents challenges for clinicians. The purpose of this study is to compare the measurement accuracy of our developed artificial intelligence measurement system for Cobb angles and Lenke classification in AIS patients with manual measurements to validate its feasibility. An artificial intelligence (AI) system measured the Cobb angle of AIS patients using convolutional neural networks, which identified the vertebral boundaries and sequences, recognized the upper and lower end vertebras, and estimated the Cobb angles of the proximal thoracic, main thoracic, and thoracolumbar/lumbar curves sequentially. Accordingly, the Lenke classifications of scoliosis were divided by oscillogram and defined by the AI system. Furthermore, a man-machine comparison (n = 300) was conducted for senior spine surgeons (n = 2), junior spine surgeons (n = 2), and the AI system for the image measurements of proximal thoracic (PT), main thoracic (MT), thoracolumbar/lumbar (TL/L), thoracic sagittal profile T5-T12, bending views PT, bending views MT, bending views TL/L, the Lenke classification system, the lumbar modifier, and sagittal thoracic alignment. In the AI system, the calculation time for each patient's data was 0.2 s, while the measurement time for each surgeon was 23.6 min. The AI system showed high accuracy in the recognition of the Lenke classification and had high reliability compared to senior doctors(ICC 0.962). The AI system has high reliability for the Lenke classification and is a potential auxiliary tool for spinal surgeons.

  • Research Article
  • Cite Count Icon 131
  • 10.1002/ctm2.1216
The potential impact of ChatGPT in clinical and translational medicine.
  • Mar 1, 2023
  • Clinical and Translational Medicine
  • Vivian Weiwen Xue + 2 more

The potential impact of ChatGPT in clinical and translational medicine.

  • Research Article
  • Cite Count Icon 41
  • 10.1016/j.fertnstert.2020.10.040
Predictive modeling in reproductive medicine: Where will the future of artificial intelligence research take us?
  • Nov 1, 2020
  • Fertility and Sterility
  • Carol Lynn Curchoe + 18 more

Predictive modeling in reproductive medicine: Where will the future of artificial intelligence research take us?

  • Research Article
  • 10.1007/s00586-025-09106-2
A novel artificial Intelligence-Based model for automated Lenke classification in adolescent idiopathic scoliosis.
  • Jul 11, 2025
  • European spine journal : official publication of the European Spine Society, the European Spinal Deformity Society, and the European Section of the Cervical Spine Research Society
  • Kunjie Xie + 6 more

To develop an artificial intelligence (AI)-driven model for automatic Lenke classification of adolescent idiopathic scoliosis (AIS) and assess its performance. This retrospective study utilized 860 spinal radiographs from 215 AIS patients with four views, including 161 training sets and 54 testing sets. Additionally, 1220 spinal radiographs from 610 patients with only anterior-posterior (AP) and lateral (LAT) views were collected for training. The model was designed to perform keypoint detection, pedicle segmentation, and AIS classification based on a custom classification strategy. Its performance was evaluated against the gold standard using metrics such as mean absolute difference (MAD), intraclass correlation coefficient (ICC), Bland-Altman plots, Cohen's Kappa, and the confusion matrix. In comparison to the gold standard, the MAD for all predicted angles was 2.29°, with an excellent ICC. Bland-Altman analysis revealed minimal differences between the methods. For Lenke classification, the model exhibited exceptional consistency in curve type, lumbar modifier, and thoracic sagittal profile, with average Kappa values of 0.866, 0.845, and 0.827, respectively, and corresponding accuracy rates of 87.07%, 92.59%, and 92.59%. Subgroup analysis further confirmed the model's high consistency, with Kappa values ranging from 0.635 to 0.930, 0.672 to 0.926, and 0.815 to 0.847, and accuracy rates between 90.7 and 98.1%, 92.6-98.3%, and 92.6-98.1%, respectively. This novel AI system facilitates the rapid and accurate automatic Lenke classification, offering potential assistance to spinal surgeons.

  • Research Article
  • Cite Count Icon 1
  • 10.7759/cureus.77067
Evaluation of the Accuracy of Artificial Intelligence (AI) Models in Dermatological Diagnosis and Comparison With Dermatology Specialists.
  • Jan 7, 2025
  • Cureus
  • Yuto Yamamura + 3 more

Recent advances in generative artificial intelligence (AI) have expanded its applications in diagnostic support within dermatology, but its clinical accuracy requires ongoing evaluation. This study compared the diagnostic performance of three advanced AI models, ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, with that of board-certified dermatologists, using a dataset of 30 cases encompassing a variety of dermatological conditions. The AI models demonstrated diagnostic accuracy comparable to, and sometimes exceeding, that of the specialists, particularly in rare and complex cases. Statistical analysis revealed no significant difference in accuracy rates between the AI models and dermatologists, indicating that AI may serve as a valuable supplementary diagnostic tool in dermatological practice. Limitations include a small sample size and potential selection bias. However, these findings underscore the progress in AI's diagnostic capabilities, supporting further validation with larger datasets and diverse clinical scenarios to confirm its practical utility.

  • Research Article
  • 10.1186/s13054-025-05468-7
A large language model improves clinicians’ diagnostic performance in complex critical illness cases
  • Jun 6, 2025
  • Critical Care
  • Xintong Wu + 2 more

BackgroundLarge language models (LLMs) have demonstrated potential in assisting clinical decision-making. However, studies evaluating LLMs’ diagnostic performance on complex critical illness cases are lacking. We aimed to assess the diagnostic accuracy and response quality of an artificial intelligence (AI) model, and evaluate its potential benefits in assisting critical care residents with differential diagnosis of complex cases.MethodsThis prospective comparative study collected challenging critical illness cases from the literature. Critical care residents from tertiary teaching hospitals were recruited and randomly assigned to non-AI-assisted physician and AI-assisted physician groups. We selected a reasoning model, DeepSeek-R1, for our study. We evaluated the model’s response quality using Likert scales, and we compared the diagnostic accuracy and efficiency between groups.ResultsA total of 48 cases were included. Thirty-two critical care residents were recruited, with 16 residents assigned to each group. Each resident handled an average of 3 cases. DeepSeek-R1’s responses received median Likert grades of 4.0 (IQR 4.0–5.0; 95% CI 4.0–4.5) for completeness, 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0) for clarity, and 5.0 (IQR 4.0–5.0; 95% CI 4.0–5.0) for usefulness. The AI model’s top diagnosis accuracy was 60% (29/48; 95% CI 0.456–0.729), with a median differential diagnosis quality score of 5.0 (IQR 4.0–5.0; 95% CI 4.5–5.0). Top diagnosis accuracy was 27% (13/48; 95% CI 0.146–0.396) in the non-AI-assisted physician group versus 58% (28/48; 95% CI 0.438–0.729) in the AI-assisted physician group. Median differential quality scores were 3.0 (IQR 0–5.0; 95% CI 2.0–4.0) without and 5.0 (IQR 3.0–5.0; 95% CI 3.0–5.0) with AI assistance. The AI model showed higher diagnostic accuracy than residents, and AI assistance significantly improved residents’ accuracy. The residents’ diagnostic time significantly decreased with AI assistance (median, 972 s; IQR 570–1320; 95% CI 675–1200) versus without (median, 1920 s; IQR 1320–2640; 95% CI 1710–2370).ConclusionsFor diagnostically difficult critical illness cases, DeepSeek-R1 generates high-quality information, achieves reasonable diagnostic accuracy, and significantly improves residents’ diagnostic accuracy and efficiency. Reasoning models are suggested to be promising diagnostic adjuncts in intensive care units.

  • Research Article
  • 10.54364/aaiml.2024.43159
Predicting Mandibular Bone Growth Using Artificial Intelligence and Machine Learning: A Systematic Review
  • Jan 1, 2024
  • Advances in Artificial Intelligence and Machine Learning
  • Mahmood Dashti + 6 more

Introduction The accurate prediction of mandibular bone growth is crucial in orthodontics and maxillofacial surgery, impacting treatment planning and patient outcomes. Traditional methods often fall short due to their reliance on linear models and clinician expertise, which are prone to human error and variability. Artificial intelligence (AI) and machine learning (ML) offer advanced alternatives, capable of processing complex datasets to provide more accurate predictions. This systematic review examines the efficacy of AI and ML models in predicting mandibular growth compared to traditional methods. Method. A systematic review was conducted following the PRISMA guidelines, focusing on studies published up to July 2024. Databases searched included PubMed, Embase, Scopus, and Web of Science. Studies were selected based on their use of AI and ML algorithms for predicting mandibular growth. A total of 31 studies were identified, with 6 meeting the inclusion criteria. Data were extracted on study characteristics, AI models used, and prediction accuracy. The risk of bias was assessed using the QUADAS-2 tool. Results. The review found that AI and ML models generally provided high accuracy in predicting mandibular growth. For instance, the LASSO model achieved an average error of 1.41 mm for predicting skeletal landmarks. However, not all AI models outperformed traditional methods; in some cases, deep learning models were less accurate than conventional growth prediction models. Discussion. The variability in datasets and study designs across the included studies posed challenges for comparing AI models’ effectiveness. Additionally, the complexity of AI models may limit their clinical applicability. Despite these challenges, AI and ML show significant promise in enhancing predictive accuracy for mandibular growth. Conclusion. AI and ML models have the potential to revolutionize mandibular growth prediction, offering greater accuracy and reliability than traditional methods. However, further research is needed to standardize methodologies, expand datasets, and improve model interpretability for clinical integration.

  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.03.14.25323836
Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT Assisting in Diagnosis of Corneal Eye Diseases, Glaucoma, and Neuro-Ophthalmology Diseases Based on Clinical Case Reports.
  • Mar 17, 2025
  • medRxiv : the preprint server for health sciences
  • Zain S Hussain + 12 more

This study evaluates the diagnostic performance of several AI models, including Deepseek, in diagnosing corneal diseases, glaucoma, and neuro□ophthalmologic disorders. We retrospectively selected 53 case reports from the Department of Ophthalmology and Visual Sciences at the University of Iowa, comprising 20 corneal disease cases, 11 glaucoma cases, and 22 neuro□ophthalmology cases. The case descriptions were input into DeepSeek, ChatGPT□4.0, ChatGPT□01, and Qwens 2.5 Max. These responses were compared with diagnoses rendered by human experts (corneal specialists, glaucoma attendings, and neuro□ophthalmologists). Diagnostic accuracy and interobserver agreement, defined as the percentage difference between each AI model's performance and the average human expert performance, were determined. DeepSeek achieved an overall diagnostic accuracy of 79.2%, with specialty-specific accuracies of 90.0% in corneal diseases, 54.5% in glaucoma, and 81.8% in neuro□ophthalmology. ChatGPT□01 outperformed the other models with an overall accuracy of 84.9% (85.0% in corneal diseases, 63.6% in glaucoma, and 95.5% in neuro□ophthalmology), while Qwens exhibited a lower overall accuracy of 64.2% (55.0% in corneal diseases, 54.5% in glaucoma, and 77.3% in neuro□ophthalmology). Interobserver agreement analysis revealed that in corneal diseases, DeepSeek differed by -3.3% (90.0% vs 93.3%), ChatGPT□01 by -8.3%, and Qwens by -38.3%. In glaucoma, DeepSeek outperformed the human expert average by +3.0% (54.5% vs 51.5%), while ChatGPT□4.0 and ChatGPT□01 exceeded it by +12.1%, and Qwens was +3.0% above the human average. In neuro□ophthalmology, DeepSeek and ChatGPT□4.0 were 9.1% lower than the human average, ChatGPT□01 exceeded it by +4.6%, and Qwens was 13.6% lower. ChatGPT□01 demonstrated the highest overall diagnostic accuracy, especially in neuro□ophthalmology, while DeepSeek and ChatGPT□4.0 showed comparable performance. Qwens underperformed relative to the other models, especially in corneal diseases. Although these AI models exhibit promising diagnostic capabilities, they currently lag behind human experts in certain areas, underscoring the need for a collaborative integration of clinical judgment. This study evaluated how well several artificial intelligence (AI) models diagnose eye diseases compared to human experts. We tested four AI systems across three types of eye conditions: diseases of the cornea, glaucoma, and neuro-ophthalmologic disorders. Overall, one AI model, ChatGPT-01, performed the best, correctly diagnosing about 85% of cases, and it excelled in neuro-ophthalmology by correctly diagnosing 95.5% of cases. Two other models, DeepSeek and ChatGPT-4.0, each achieved an overall accuracy of around 79%, while the Qwens model performed lower, with an overall accuracy of about 64%. When compared with human experts, who achieved very high accuracy in corneal diseases (93.3%) and neuro-ophthalmology (90.9%) but lower in glaucoma (51.5%), the AI models showed mixed results. In glaucoma, for instance, some AI models even outperformed human experts slightly, while in corneal diseases, all AI models were less accurate than the experts. These findings indicate that while AI shows promise as a supportive tool in diagnosing eye conditions, it still needs further improvement. Combining AI with human clinical judgment appears to be the best approach for accurate eye disease diagnosis. Why carry out this study? With the rising burden of eye diseases and the inherent diagnostic challenges for complex conditions like glaucoma and neuro-ophthalmologic disorders, there is an unmet need for innovative diagnostic tools to support clinical decision-making. What did the study ask? This study evaluated the diagnostic performance of four AI models across three ophthalmologic subspecialties, testing the hypothesis that advanced language models can achieve accuracy levels comparable to human experts. What was learned from the study? Our results showed that ChatGPT-01 achieved the highest overall accuracy (84.9%), excelling in neuro-ophthalmology with a 95.5% accuracy, while DeepSeek and ChatGPT-4.0 each achieved 79.2%, and Qwens reached 64.2%. What specific outcomes were observed? In glaucoma, AI model accuracies ranged from 54.5% to 63.6%, with some models slightly surpassing the human expert average of 51.5%, underscoring the diagnostic difficulty of this condition. What has been learned and future implications? These findings highlight the potential of AI as a valuable adjunct to clinical judgment in ophthalmology, although further research and the integration of multimodal data are essential to optimize these tools for routine clinical practice.

  • Research Article
  • Cite Count Icon 1
  • 10.4103/jpbs.jpbs_852_23
Assessment of AI Models in Predicting Treatment Outcomes in Orthodontics
  • Feb 1, 2024
  • Journal of Pharmacy and Bioallied Sciences
  • Mohammad K Alam + 3 more

Background: In the realm of orthodontics, the evaluation of treatment outcomes is a pivotal aspect. In recent times, artificial intelligence (AI) models have garnered attention as potential tools for predicting these outcomes. These AI models have the potential to enhance treatment planning and decision-making processes. However, a comprehensive assessment of their effectiveness and accuracy is essential before their widespread integration. Materials and Methods: In this study, we assessed the capability of AI models to predict treatment outcomes in orthodontics. A sample of 30 patients undergoing orthodontic treatment was selected. Various patient-specific parameters, including age, initial malocclusion severity, and treatment approach, were collected. The AI model was trained using a dataset comprising historical treatment cases and their respective outcomes. Subsequently, the trained AI model was applied to predict the treatment outcomes for the selected patients. Results: The results of this study indicated a moderate level of accuracy in the predictions made by the AI model. Out of the 30 patients, the model accurately predicted treatment outcomes for 22 patients, yielding a success rate of approximately 73%. However, the model exhibited limitations in accurately predicting outcomes for cases involving complex malocclusions or those requiring non-standard treatment approaches. Conclusion: In conclusion, this study underscores the potential of AI models in predicting treatment outcomes in orthodontics. While the AI model demonstrated promising accuracy in the majority of cases, its efficacy was diminished in complex and non-standard cases. Therefore, while AI models can serve as valuable tools to aid orthodontists in treatment planning, they should be utilized in conjunction with clinical expertise to ensure optimal decision-making and patient care.

  • Research Article
  • Cite Count Icon 7
  • 10.1016/j.joms.2021.02.031
Artificial Intelligence: The Future of Maxillofacial Prognosis and Diagnosis?
  • Feb 26, 2021
  • Journal of Oral and Maxillofacial Surgery
  • Peter Rekawek + 2 more

Artificial Intelligence: The Future of Maxillofacial Prognosis and Diagnosis?

  • Research Article
  • Cite Count Icon 24
  • 10.1016/j.spinee.2013.07.449
Artificial neural networks assessing adolescent idiopathic scoliosis: comparison with Lenke classification
  • Oct 2, 2013
  • The Spine Journal
  • Philippe Phan + 4 more

Artificial neural networks assessing adolescent idiopathic scoliosis: comparison with Lenke classification

  • Research Article
  • 10.2196/72815
AI in Qualitative Health Research Appraisal: Comparative Study
  • Jul 8, 2025
  • JMIR Formative Research
  • August Landerholm

BackgroundQualitative research appraisal is crucial for ensuring credible findings but faces challenges due to human variability. Artificial intelligence (AI) models have the potential to enhance the efficiency and consistency of qualitative research assessments.ObjectiveThis study aims to evaluate the performance of 5 AI models (GPT-3.5, Claude 3.5, Sonar Huge, GPT-4, and Claude 3 Opus) in assessing the quality of qualitative research using 3 standardized tools: Critical Appraisal Skills Programme (CASP), Joanna Briggs Institute (JBI) checklist, and Evaluative Tools for Qualitative Studies (ETQS).MethodsAI-generated assessments of 3 peer-reviewed qualitative papers in health and physical activity–related research were analyzed. The study examined systematic affirmation bias, interrater reliability, and tool-dependent disagreements across the AI models. Sensitivity analysis was conducted to evaluate the impact of excluding specific models on agreement levels.ResultsResults revealed a systematic affirmation bias across all AI models, with “Yes” rates ranging from 75.9% (145/191; Claude 3 Opus) to 85.4% (164/192; Claude 3.5). GPT-4 diverged significantly, showing lower agreement (“Yes”: 115/192, 59.9%) and higher uncertainty (“Cannot tell”: 69/192, 35.9%). Proprietary models (GPT-3.5 and Claude 3.5) demonstrated near-perfect alignment (Cramer V=0.891; P<.001), while open-source models showed greater variability. Interrater reliability varied by assessment tool, with CASP achieving the highest baseline consensus (Krippendorff α=0.653), followed by JBI (α=0.477), and ETQS scoring lowest (α=0.376). Sensitivity analysis revealed that excluding GPT-4 increased CASP agreement by 20% (α=0.784), while removing Sonar Huge improved JBI agreement by 18% (α=0.561). ETQS showed marginal improvements when excluding GPT-4 or Claude 3 Opus (+9%, α=0.409). Tool-dependent disagreements were evident, particularly in ETQS criteria, highlighting AI’s current limitations in contextual interpretation.ConclusionsThe findings demonstrate that AI models exhibit both promise and limitations as evaluators of qualitative research quality. While they enhance efficiency, AI models struggle with reaching consensus in areas requiring nuanced interpretation, particularly for contextual criteria. The study underscores the importance of hybrid frameworks that integrate AI scalability with human oversight, especially for contextual judgment. Future research should prioritize developing AI training protocols that emphasize qualitative epistemology, benchmarking AI performance against expert panels to validate accuracy thresholds, and establishing ethical guidelines for disclosing AI’s role in systematic reviews. As qualitative methodologies evolve alongside AI capabilities, the path forward lies in collaborative human-AI workflows that leverage AI’s efficiency while preserving human expertise for interpretive tasks.

  • Research Article
  • 10.1158/1538-7445.advbc23-b078
Abstract B078: Artificial-intelligence-driven breast density assessment in the transition from full-field digital mammograms to digital breast tomosynthesis
  • Feb 1, 2024
  • Cancer Research
  • Krisha Anant + 3 more

Introduction: To enhance reproducibility and robustness in mammographic density assessment, various artificial intelligence (AI) models have been proposed to automatically classify mammographic images into BI-RADS density categories. Despite their promising performances, so far density AI models have been assessed primarily in traditional full-field digital mammography (FFDM) images. Our study aims to assess the potential of AI in breast density assessment in FFDM versus the newer synthetic mammography (SM) images acquired with digital breast tomosynthesis. Methods: We retrospectively analyzed negative (BI-RADS 1 or 2) routine mammographic screening exams (Selenia or Selenia Dimensions; Hologic) acquired at sites within the Barnes-Jewish/Christian (BJC) Healthcare network in St. Louis, MO from 2015 to 2018. BI-RADS breast density assessments of radiologists were obtained from BJC’s mammography reporting software (Magview 7.1). For each mammographic imaging modality, a balanced dataset of 4,000 women was selected so there were equal numbers of women in each of the four BI-RADS density categories, and each woman had at least one mediolateral oblique (MLO) and one craniocaudal (CC) view per breast in that mammographic imaging modality. Previously validated pre-processing steps were applied to all FFDM and SM images to standardize image orientation and intensity. Images were then split into training, validation, and test sets at ratios of 80%, 10%, and 10%, respectively, while maintaining the distribution of breast density categories and ensuring that all images of the same woman appear only in one set. Our AI model was based on the widely used ResNet50 architecture and was designed to accept as an input a mammographic image and predict the BI-RADS breast density category that the image belongs to. Our AI model was optimized, trained, and evaluated separately for each mammographic imaging modality. We report on the AI model’s predictive accuracy on the test set for each mammographic imaging modality, for both views as well as separately for CC and MLO; accuracy differences in FFDM versus SM were assessed via bootstrapping. Results: A batch size of 32, learning rate of e-6, and Adam optimizer were chosen as the optimal hyperparameters for our AI model. Using the same hyperparameters, the AI model demonstrated substantially higher accuracy on the test set for FFDM than for SM (FFDM: accuracy = 71% ± 4.5% versus SM: accuracy = 66% ± 4.2%; p-value&amp;lt;0.001 for comparison). Similar conclusion held when CC and MLO views were evaluated separately (accuracy = 72% ± 4.6% versus 66% ± 4.3% for CC; accuracy = 69% ± 4.5% versus 62% ± 4.3% for MLO; p-value&amp;lt;0.001 for both comparisons). Conclusions: AI performance in BI-RADS breast density assessment was significantly higher on FFDM versus SM, even under the same AI model design, dataset size and training process. Our preliminary findings suggest that further AI optimizations and adaptations may be needed as we translate AI models from FFDM to the newer SM format acquired with digital breast tomosynthesis. Citation Format: Krisha Anant, Juanita Hernandez Lopez, Debbie Bennett, Aimilia Gastounioti. Artificial-intelligence-driven breast density assessment in the transition from full-field digital mammograms to digital breast tomosynthesis [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Advances in Breast Cancer Research; 2023 Oct 19-22; San Diego, California. Philadelphia (PA): AACR; Cancer Res 2024;84(3 Suppl_1):Abstract nr B078.

More from: Coluna/Columna
  • Research Article
  • 10.1590/s1808-185120252402295460
AVALIAçãO FUNCIONAL PóS OPERATóRIA DA ACDF VERSUS PCDF EM MCD: UMA REVISãO SISTEMáTICA E META-ANáLISE
  • Jan 1, 2025
  • Coluna/Columna
  • Pacai Gunsch Arigbatsa + 6 more

  • Research Article
  • 10.1590/s1808-185120252402293157
INTELIGÊNCIA ARTIFICIAL NO DIAGNÓSTICO DA ESCOLIOSE: UM ESTUDO COMPARATIVO ENTRE CHATGPT E CIRURGIÕES
  • Jan 1, 2025
  • Coluna/Columna
  • Lucas Silveira Rabello De Oliveira + 7 more

  • Research Article
  • 10.1590/s1808-185120252401285035
AVALIAÇÃO DO MANEJO PRIMÁRIO DE CASOS DE COLUNA ENTRE MÉDICOS NAS REDES DE SAÚDE DE SÃO PAULO
  • Jan 1, 2025
  • Coluna/Columna
  • Gabriela Neves Vaz + 3 more

  • Open Access Icon
  • Research Article
  • 10.1590/s1808-185120252401284944
DOENÇA LINFOPROLIFERATIVA NA COLUNA: FATORES PREDITIVOS DE COMPLICAÇÃO PÓS-OPERATÓRIA
  • Jan 1, 2025
  • Coluna/Columna
  • Rafael Moraes Trincado + 8 more

  • Research Article
  • 10.1590/s1808-185120252402293128
INSTABILIDADE ATLANTOAXIAL EM CRIANÇAS COM SÍNDROME DE DOWN
  • Jan 1, 2025
  • Coluna/Columna
  • Catarina Massano + 3 more

  • Research Article
  • 10.1590/s1808-185120252402292068
EFICÁCIA DO PLASMA RICO EM PLAQUETAS NA MELHORIA DA FUSÃO ESPINHAL: UMA REVISÃO SISTEMÁTICA E META-ANÁLISE
  • Jan 1, 2025
  • Coluna/Columna
  • Alhoi Hendry Henderson + 4 more

  • Research Article
  • 10.1590/s1808-185120252402293832
MANEJO DE LESõES POR ARMA DE FOGO RETIDAS NA COLUNA VERTEBRAL: UMA REVISãO SISTEMáTICA DA LITERATURA
  • Jan 1, 2025
  • Coluna/Columna
  • Emiliano Neves Vialle + 6 more

  • Open Access Icon
  • Research Article
  • 10.1590/s1808-185120252401289576
COMPARAÇÃO DA FIXAÇÃO COM PARAFUSO MONO E POLIAXIAL EM FRATURAS TORACOLOMBARES A3/A4
  • Jan 1, 2025
  • Coluna/Columna
  • Emiliano Neves Vialle + 4 more

  • Research Article
  • 10.1590/s1808-185120252402291780
ANáLISE DA ASSOCIAçãO ENTRE O USO PROLONGADO DE CELULARES E DOR NAS COSTAS EM ESTUDANTES DO ENSINO MéDIO E SUPERIOR NO BRASIL
  • Jan 1, 2025
  • Coluna/Columna
  • Bianca Lunna Pereira Da Silva Costa Barros + 7 more

  • Open Access Icon
  • Research Article
  • 10.1590/s1808-185120252401289025
ENDOSCOPIA EM DOENÇA DO NÍVEL ADJACENTE LOMBAR - REVISÃO SISTEMÁTICA E META-ANÁLISE
  • Jan 1, 2025
  • Coluna/Columna
  • Pedro Fellipe Deborto Rudine Remolli Evangelista + 3 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon