A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

Joseph Ross Mitchell,Phillip Szepietowski,Rachel Howard,Dana E Rollison,Patricia Lewis,Phillip Reisman,Jennie D Jones,Brooke L Fridley

doi:10.2196/27210

Joseph Ross Mitchell, Phillip Szepietowski + Show 6 more

Open Access

https://doi.org/10.2196/27210

Copy DOI

Journal: Journal of Medical Internet Research	Publication Date: Mar 23, 2022
Citations: 15	License type: cc-by

Affiliation: Moffitt Cancer Center

Abstract

BackgroundInformation in pathology reports is critical for cancer care. Natural language processing (NLP) systems used to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, bidirectional encoder representations from transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question answering, named entity recognition, speech recognition, and more.ObjectiveThe aim of this study is to develop a BERT-based system to automatically extract detailed tumor site and histology information from free-text oncological pathology reports.MethodsWe pursued three specific aims: extract accurate tumor site and histology descriptions from free-text pathology reports, accommodate the diverse terminology used to indicate the same pathology, and provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a question-and-answer (Q&A) model that connects a Q&A layer to the base pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: What organ contains the tumor? and What is the kind of tumor or carcinoma? This involved supervised training on 8197 pathology reports, each with ground truth answers to these 2 questions determined by certified tumor registrars. The data set included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict International Classification of Diseases for Oncology, Third Edition (ICD-O-3), site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes and another to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this CancerBERT network (caBERTnet). We evaluated caBERTnet using a sequestered test data set of 2050 pathology reports with ground truth answers determined by certified tumor registrars.ResultscaBERTnet’s accuracies for predicting group-level site and histology codes were 93.53% (1895/2026) and 97.6% (1993/2042), respectively. The top 5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training data set were 92.95% (1794/1930) and 96.01% (1853/1930), respectively.ConclusionsWe have developed an NLP system that outperforms existing algorithms at predicting ICD-O-3 codes across an extensive range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

Abstract

Talk to us

Similar Papers

More From: Journal of Medical Internet Research

Lead the way for us

Similar Papers

Abstract 2101: Deep learning for automatic extraction of tumor site and histology from unstructured pathology reports
Ross Mitchell ... Rachel Howard
Cancer Research | VOL. 80
Ross Mitchell, et. al.Ross Mitchell ... Rachel Howard
13 Aug 2020
Abstract 2101: Deep learning for automatic extraction of tumor site and histology from unstructured pathology reports
Ross Mitchell ... Rachel Howard

Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
Rajesh Gupta
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3
Rajesh GuptaRajesh Gupta
02 Mar 2024
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3

The EMory BrEast imaging Dataset (EMBED): A Racially Diverse, Granular Dataset of 3.4 Million Screening and Diagnostic Mammographic Images.
Jiwoong J Jeong ... Gabriela Oprea
Radiology: Artificial Intelligence | VOL. 5
Jiwoong J Jeong, et. al.Jiwoong J Jeong ... Gabriela Oprea
01 Jan 2023
Radiology: Artificial Intelligence | VOL. 5

Comparing human coding to two natural language processing algorithms in aspirations of people affected by Duchenne Muscular Dystrophy
Carolyn E Schwartz ... Roland B Stark
Journal of Methods and Measurement in the Social Sciences | VOL. 13
Carolyn E Schwartz, et. al.Carolyn E Schwartz ... Roland B Stark
01 Oct 2022
Journal of Methods and Measurement in the Social Sciences | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study.

Abstract

Talk to us

Similar Papers

More From: Journal of Medical Internet Research