Abstract

We explored various Machine Learning (ML) models to evaluate how each model performs in the task of classifying histopathology reports. We trained, optimized, and performed classification with Stochastic Gradient Descent (SGD), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbor (KNN), Adaptive Boosting (AB), Decision Trees (DT), Gaussian Naïve Bayes (GNB), Logistic Regression (LR), and Dummy classifier. We started with 60,083 histopathology reports, which reduced to 60,069 after pre-processing. The F1-scores for SVM, SGD KNN, RF, DT, LR, AB, and GNB were 97%, 96%, 96%, 96%, 92%, 96%, 84%, and 88%, respectively, while the misclassification rates were 3.31%, 5.25%, 4.39%, 1.75%, 3.5%, 4.26%, 23.9%, and 19.94%, respectively. The approximate run times were 2 h, 20 min, 40 min, 8 h, 40 min, 10 min, 50 min, and 4 min, respectively. RF had the longest run time but the lowest misclassification rate on the labeled data. Our study demonstrated the possibility of applying ML techniques in the processing of free-text pathology reports for cancer registries for cancer incidence reporting in a Sub-Saharan Africa setting. This is an important consideration for the resource-constrained environments to leverage ML techniques to reduce workloads and improve the timeliness of reporting of cancer statistics.

Highlights

  • The South African National Cancer Registry (NCR) is responsible for the registration of all malignancies, including histopathologically diagnosed malignancies, and annual reporting of cancer statistics for South Africa (SA) [1,2]

  • We evaluated our models by calculating the accuracy, precision, recall, F1 -score, misclassification rate, micro-average, and macro-average

  • A total of 60,083 histology reports were registered by the National Health Laboratory Service (NHLS) for the Western Cape Province in

Read more

Summary

Introduction

The South African National Cancer Registry (NCR) is responsible for the registration of all malignancies, including histopathologically diagnosed malignancies, and annual reporting of cancer statistics for South Africa (SA) [1,2]. The NCR receives over 100,000 cancer pathology reports annually from pathology laboratories in SA [1,2]. All cancer pathology reports are coded according to the International Classification of Diseases for Oncology 3rd edition (ICD-O-3), reports are de-duplicated to identify index cancer cases, and the cancer statistics are calculated and reported annually [1,2]. The NCR receives pathology reports from both private and public laboratories throughout. Trained data coders perform medical data abstraction and code the malignant reports using the ICD-O-3 topography and morphology classification for downstream analysis [3]. The medical data abstraction process is labor-intensive, Information 2020, 11, 455; doi:10.3390/info11090455 www.mdpi.com/journal/information

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call