Abstract

Lung cancer causes more deaths globally than any other type of cancer. To determine the best treatment, detecting EGFR and KRAS mutations is of interest. However, non-invasive ways to obtain this information are not available. Furthermore, many times there is a lack of big enough relevant public datasets, so the performance of single classifiers is not outstanding. In this paper, an ensemble approach is applied to increase the performance of EGFR and KRAS mutation prediction using a small dataset. A new voting scheme, Selective Class Average Voting (SCAV), is proposed and its performance is assessed both for machine learning models and CNNs. For the EGFR mutation, in the machine learning approach, there was an increase in the sensitivity from 0.66 to 0.75, and an increase in AUC from 0.68 to 0.70. With the deep learning approach, an AUC of 0.846 was obtained, and with SCAV, the accuracy of the model was increased from 0.80 to 0.857. For the KRAS mutation, both in the machine learning models (0.65 to 0.71 AUC) and the deep learning models (0.739 to 0.778 AUC), a significant increase in performance was found. The results obtained in this work show how to effectively learn from small image datasets to predict EGFR and KRAS mutations, and that using ensembles with SCAV increases the performance of machine learning classifiers and CNNs. The results provide confidence that as large datasets become available, tools to augment clinical capabilities can be fielded.

Highlights

  • Lung cancer is the leading cause of cancer-related death in men and the second-leading cause in women

  • We analyzed the effectiveness of using ensembles in the prediction of Epidermal Growth Factor Receptor (EGFR) and Kirsten Rat Sarcoma viral oncogene (KRAS) mutations using a small dataset; in particular, we assessed the performance of a novel voting scheme Selective Class Average Voting (SCAV)

  • We tested this scheme with both ensembles of machine learning models and ensembles of Convolutional Neural Networks (CNN) and a significant improvement from the base classifiers was observed

Read more

Summary

Introduction

Lung cancer is the leading cause of cancer-related death in men and the second-leading cause in women. The use of surrogate sources of DNA, such as blood, serum, and plasma samples, which often contain circulating free tumor (cft) DNA or circulating tumor cells (CTCs), is emerging as a new strategy for tumor genotyping [3]. This technique is pretty recent and still has some disadvantages. Even when recent versions of liquid biopsy techniques have been approved for clinical use, the sensitivity (or True Positive Rate) of this test is still a weak point [3] All these concerns provide space for the application of other non-invasive techniques that may be more effective in early stages of cancer and may provide higher sensitivity rates

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call