Classification of imbalanced oral cancer image data from high-risk population.

Bofan Song,Keerthi Gurushanth,Vijay Pillai,Sanjana Patrick,Praveen Birur,Subhashini Raghavan,Amritha Suresh,Vivek Shetty,Alben Sigamani,Moni Abraham Kuriakose,Rongguang Liang,Nirza Mukhia,Sumsum Sunny,Trupti Kolur,Shubha Gurudath,Pramila Mendonca,Shaobai Li,Imchen Tsusennaro,Petra Wilder‐Smith ,Shirley T Leivon ,T.w Peterson ,Rohan Michael Ramesh ,Vidya Bhushan Rangappa

doi:10.1117/1.jbo.26.10.105001

Abstract

.Significance: Early detection of oral cancer is vital for high-risk patients, and machine learning-based automatic classification is ideal for disease screening. However, current datasets collected from high-risk populations are unbalanced and often have detrimental effects on the performance of classification.Aim: To reduce the class bias caused by data imbalance.Approach: We collected 3851 polarized white light cheek mucosa images using our customized oral cancer screening device. We use weight balancing, data augmentation, undersampling, focal loss, and ensemble methods to improve the neural network performance of oral cancer image classification with the imbalanced multi-class datasets captured from high-risk populations during oral cancer screening in low-resource settings.Results: By applying both data-level and algorithm-level approaches to the deep learning training process, the performance of the minority classes, which were difficult to distinguish at the beginning, has been improved. The accuracy of “premalignancy” class is also increased, which is ideal for screening applications.Conclusions: Experimental results show that the class bias induced by imbalanced oral cancer image datasets could be reduced using both data- and algorithm-level methods. Our study may provide an important basis for helping understand the influence of unbalanced datasets on oral cancer deep learning classifiers and how to mitigate.

Highlights

Oral cancer is a common disease, in low- and middle-income countries
Automatic oral cancer image classification algorithms based on machine learning enable a system to learn from previous data and, based on the learning, predict and give results on new unseen data
We have investigated and compared the performance of different approaches for imbalanced oral cancer image classification

Summary

Introduction

Oral cancer is a common disease, in low- and middle-income countries. Detection of oral cancer is believed to be the most effective way to prevent it. Automatic oral cancer image classification algorithms based on machine learning enable a system to learn from previous data and, based on the learning, predict and give results on new unseen data. Oral cancer datasets captured from high-risk populations are often unbalanced since there are many more normal cases than benign, premalignant, and malignant cases. The discrimination of minority classes is of great clinical significance. A false positive result for benign lesions in cancer screening will result in unnecessary psychological stress, medical procedures to patients, and increased clinical workloads. Classifiers need high sensitivity because they are designed for cancer screening. One of the big challenges is handling class imbalance, especially for multiple category classification

Objectives

Methods

Results

Discussion

Conclusion