Abstract Introduction: Much of the information in electronic medical records (EMRs) required for the practice of clinical oncology is contained in unstructured text. While natural language processing (NLP) has been used to extract information from EMR text, accuracy is suboptimal. In late 2018 a powerful new deep-learning NLP algorithm was published: Bidirectional Encoder Representations from Transformers (BERT). BERT set new accuracy records and for the first time achieved human-level performance on several NLP benchmarks. Our goal was to train BERT to extract clinically relevant data from pathology reports with high accuracy. Procedures: Like many cancer centers nationwide, Moffitt Cancer Center employs Certified Tumor Registrars (CTRs) to collect and report data about cancer patients to state and federal agencies. The CTR extracted data are labels that identify, with high accuracy, important information in each pathology report. Consequently, we used this data to tune BERT to perform a question-and-answering (Q&A) task. Our system sought the answers to 2 predetermined questions in each pathology report: “What organ contains the tumor?”, and “What is the kind of tumor or carcinoma?” To achieve this, we matched surgical pathology reports created at Moffitt from January 1, 2007 onwards with structured data extracted by CTRs. The resulting dataset was randomly divided into training (80%) and testing (20%) subsets. After Q&A training, model performance was assessed using the test dataset. Two metrics were calculated for each question: a true-or-false indication of a perfect word-for-word match between the BERT-extracted data and CTR-extracted data; and, the F1 statistic. The latter produces a value between 0% and 100% indicating the degree of overlap between words in the BERT-extracted data and words in the CTR-extracted data. Results: The final dataset contained 14,143 pathology reports (11,520 for training, 2,623 for testing). This dataset included tumors from 228 organ sites involving 232 histological classifications. The three most common organ sites / histological classifications were: Prostate Gland / Adenocarcinoma (6.7%); Breast / Invasive Carcinoma (6.1%); and, Breast Overlapping Lesion / Invasive Carcinoma (5.9%). Our BERT-based Q&A system searched for answers to both questions in each test report. Thus, a total of 5,246 answers were generated. Of these, 4,667 (89%) were a perfect word-for-word match with the corresponding CTR extracted phrases. The mean F1 statistic between the BERT answers and the CTR extracted phrases was 92%. Conclusions: Future efforts will focus on improving performance via unsupervised training of the BERT language model using 484,000 Moffitt pathology reports. We will also extract additional data fields with CTR-matched ground truth labels. Ultimately new NLP transformer models could aid extraction of information from pathology reports and other EMR documents. This, in turn, could greatly facilitate personalized medicine. Citation Format: Ross Mitchell, Rachel Howard, Patricia Lewis, Katie Fellows, Jennie Jones, Phillip Reisman, Brooke Fridley, Dana Rollison. Deep learning for automatic extraction of tumor site and histology from unstructured pathology reports [abstract]. In: Proceedings of the Annual Meeting of the American Association for Cancer Research 2020; 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2101.
Read full abstract