Abstract

A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care. However, there are variations in the quality of reporting in free-text style formats, ranging from comprehensive to incomplete reporting. Moreover, the increasing incidence of cancer has generated a high throughput of pathology reports. Hence, manual extraction and classification of information from these reports can be intrinsically complex and resource-intensive. This study aimed to (i) evaluate the quality of over 80,000 breast, colorectal, and prostate cancer free-text pathology reports and (ii) assess the effectiveness of random forest (RF) and variants of support vector machine (SVM) in the classification of reports into benign and malignant classes. The study approach comprises data preprocessing, visualisation, feature selections, text classification, and evaluation of performance metrics. The performance of the classifiers was evaluated across various feature sizes, which were jointly selected by four filter feature selection methods. The feature selection methods identified established clinical terms, which are synonymous with each of the three cancers. Uni-gram tokenisation using the classifiers showed that the predictive power of RF model was consistent across various feature sizes, with overall F-scores of 95.2%, 94.0%, and 95.3% for breast, colorectal, and prostate cancer classification, respectively. The radial SVM achieved better classification performance compared with its linear variant for most of the feature sizes. The classifiers also achieved high precision, recall, and accuracy. This study supports a nationally agreed standard in pathology reporting and the use of text mining for encoding, classifying, and production of high-quality information abstractions for cancer prognosis and research.

Highlights

  • We have discussed that feature selection methods can be used to efficiently extract essential features from the individual pathology reports, which are used as input features for the classifiers

  • It is worth evaluating how the extracted features represent the critical content within the pathology report based on the performance of the classifiers

  • We have evaluated a framework to audit the quality of breast, colorectal, and prostate cancer pathology reports archived in the NHLS-Corporate Data Warehouse (CDW) between 2011 and 2019 and have developed automated machine learning (ML) algorithms to identify case reports belonging to benign or malignant class

Read more

Summary

Introduction

According to Stefan [2], adequate attention to cancer diagnoses is needed to improve the overall health of South Africans. Accurate diagnosis is a major concern in the health care system for optimal prognosis and treatment of cancer [3]. A cancer pathology report is a valuable medical document that provides information for clinical management of the patient and evaluation of health care [4]. The health care system needs to learn from past cancer pathology reports to improve cancer prognosis and the overall health of cancer patients. The overall aim of these studies was to evaluate the quality of pathology reports to improve the conformity to a set international or national standard

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call