Automated extraction of Biomarker information from pathology reports

Jeongeun Lee,Sung-Hye Park,Eunsil Yoon,Jeong-Wook Seo,Hyun-Je Song,Jinwook Choi,Seong-Bae Park,Peom Park

doi:10.1186/s12911-018-0609-7

Abstract

BackgroundPathology reports are written in free-text form, which precludes efficient data gathering. We aimed to overcome this limitation and design an automated system for extracting biomarker profiles from accumulated pathology reports.MethodsWe designed a new data model for representing biomarker knowledge. The automated system parses immunohistochemistry reports based on a “slide paragraph” unit defined as a set of immunohistochemistry findings obtained for the same tissue slide. Pathology reports are parsed using context-free grammar for immunohistochemistry, and using a tree-like structure for surgical pathology. The performance of the approach was validated on manually annotated pathology reports of 100 randomly selected patients managed at Seoul National University Hospital.ResultsHigh F-scores were obtained for parsing biomarker name and corresponding test results (0.999 and 0.998, respectively) from the immunohistochemistry reports, compared to relatively poor performance for parsing surgical pathology findings. However, applying the proposed approach to our single-center dataset revealed information on 221 unique biomarkers, which represents a richer result than biomarker profiles obtained based on the published literature. Owing to the data representation model, the proposed approach can associate biomarker profiles extracted from an immunohistochemistry report with corresponding pathology findings listed in one or more surgical pathology reports. Term variations are resolved by normalization to corresponding preferred terms determined by expanded dictionary look-up and text similarity-based search.ConclusionsOur proposed approach for biomarker data extraction addresses key limitations regarding data representation and can handle reports prepared in the clinical setting, which often contain incomplete sentences, typographical errors, and inconsistent formatting.

Highlights

Pathology reports are written in free-text form, which precludes efficient data gathering
In order to facilitate the detection of potential relationships between various immunologic biomarkers and pathologic diagnosis, we previously developed a web-based information system [8] designed to compute and display statistics of clinical data extracted from pathology reports
We associated IHC findings with microscopic findings based on tissue slide identifiers (TS_IDs), which represent the serial numbers of tissue slides; we introduced the tissue slide paragraph (TS_P) to refer to a set of IHC findings corresponding to a single tissue slide

Summary

Introduction

Pathology reports are written in free-text form, which precludes efficient data gathering. Precision medicine is a newly emerging trend in medicine, whereby individualized medical treatments are designed based on the specific biologic information of each patient. To ensure a reliable histologic diagnosis, it is important to have access to statistical data from pathology reports regarding similar patients, which can be achieved through the retrospective study of relevant reports describing biomarker. The main advantages of Pathpedia include high-level, manual curation of data and the large number of information sources (up to 4000 references). Certain biomarkers are discussed in a limited number of journal articles, which precludes data verification. There is limited coverage of various ethnic groups and uneven data distribution regarding race, life patterns, nutritional habits, geology, and climate, which precludes genomic-level comparisons based on data from Pathpedia

Objectives

Methods

Results

Conclusion