Abstract

e23196 Background: Clinical Outcome Assessment (COA) conceptual gap analyses for oncology are complex and time consuming. Artificial Intelligence may efficiently reduce time to completion of such analyses. We aimed to assess two AI models’ performance for literature screening to identify relevant qualitative oncology research. We also compared accuracy and run-time for both AI models. Methods: We manually curated a dataset of title/abstract screening (n = 1,700 study references) across 17 landscape reviews. Among these, 11 landscape reviews (n = 951 study references) spanning over were in oncology, including 8 solid cancers (breast, lung, urothelial, colorectal, esophageal, head and neck, pancreatic, and stomach) and 3 non-solid cancers (lymphoma, acute myeloid leukemia, and multiple myeloma). Each citation was annotated for eligibility (Y/N) by population, study design (qualitative), and reporting of concepts (how patients feel or function). We then compared the accuracy of two AI models at predicting the screening decisions of expert researchers. The two AI models were Generative Pre-trained Transformers 4 (GPT4, OpenAI) prompts and a fine-tuned SciFive biomedical large language model (LLM). We used 70% of the data for training and 30% for test. Accuracy estimates were obtained only for the models’ ability to label eligibility within the 11 oncology datasets. Results: Both LLMs performed well for assessing relevance by oncology population, with F1-scores for the GPT4 and SciFive models being 0.92 and 0.83 respectively (precision was 0.92 and 0.93 respectively). For concept reporting the fine-tuned SciFive model outperformed GPT4 with an F1-score and precision 0.88 and 0.92 versus 0.81 and 0.79. The same was true but less pronounced for eligibility by study design, with an F1-score and precision 0.81 and 0.90 versus 0.86 and 0.76. For overall eligibility, the customized SciFive model outperformed the GPT4 model with an F1-score and precision of 0.84 and 0.92 versus 0.85 and 0.82. Lastly, it took the GPT4 prompts between 10-30 minutes to screen 100 abstracts. By contrast, the customized SciFive model took 1-2 minutes on a computer with a Quadro RTX 8000 GPU. Conclusions: In conclusion, both AI models are promising. The fine-tuned SciFive model appears slightly more accurate and performs substantially faster than the GPT4 model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.