Abstract

Retrospective analysis of large-scale, prostate cancer databases to identify trends in cancer care and outcomes require detailed histopathologic characterization. Manual extraction of these characteristics from prostate pathology reports is a time-consuming process that is prone to human error. Our goal was to develop a rule-based software algorithm that was capable of automatically extracting pertinent characteristics (primary and secondary Gleason grade, maximum core involvement, number of positive and total cores) from prostate pathology reports. Prostate pathology reports were manually extracted from our institution’s electronic medical record system for 135 patients. The dataset was split into a training and testing set consisting of 110 and 25 patients, respectively. The training set was examined by hand to identify patterns and linguistic features that could be used to build the rules for extracting the clinical characteristics. During the training phase, it was noted that a different set of rules was required for outside slide reviews as these often presented the results in summary format as opposed to the core-by-core breakdown of those authored at our institution. These rules were then implemented in the statistical software platform, R. The performance of the algorithm was assessed in both training and testing sets by comparing the software predictions of the clinical characteristics to those made by a human observer. Of the 135 pathology reports, 29 were outside slide reviews (23 in training, and 6 in testing). The algorithm was able to correctly identify the primary and secondary Gleason grade, as well as the maximum core involvement percentage in 100% (135/135) of the data set. Due to ambiguity in reporting, four of the reports were excluded for the total core analysis, while two were excluded for the positive core analysis. The algorithm correctly identified the total cores in 95% (124/131) of the data set and the positive cores in 98% (131/133) of the data set. The testing set accuracy for the total and positive cores was 83% (19/23) and 96% (23/24), respectively. Analysis of errors revealed that four of the seven incorrectly identified total cores and both incorrectly identified positive cores were in outside slide reviews. The rule-based software algorithm was able to correctly extract and identify the primary and secondary Gleason grades, maximum core involvement, and number of positive and total cores in the majority of the examined prostate pathology reports. Standardized reporting, including a core-by-core breakdown, may lead to improved accuracy of text-mining algorithms and mitigate the need for human registrars.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.