Abstract
Classifying pan-cancer samples using gene expression patterns is a crucial challenge for the accurate diagnosis and treatment of cancer patients. Machine learning algorithms have been considered proven tools to perform downstream analysis and capture the deviations in gene expression patterns across diversified diseases. In our present work, we have developed PC-RMTL, a pan-cancer classification model using regularized multi-task learning (RMTL) for classifying 21 cancer types and adjacent normal samples using RNASeq data obtained from TCGA. PC-RMTL is observed to outperform when compared with five state-of-the-art classification algorithms, viz. SVM with the linear kernel (SVM-Lin), SVM with radial basis function kernel (SVM-RBF), random forest (RF), k-nearest neighbours (kNN), and decision trees (DT). The PC-RMTL achieves 96.07% accuracy and 95.80% MCC score for a completely unknown independent test set. The only method that appears as the real competitor is SVM-Lin, which nearly equalizes the accuracy in prediction of PC-RMTL but only when complete feature sets are provided for training; otherwise, PC-RMTL outperformed all other classification models. To the best of our knowledge, this is a significant improvement over all the existing works in pan-cancer classification as they have failed to classify many cancer types from one another reliably. We have also compared gene expression patterns of the top discriminating genes across the cancers and performed their functional enrichment analysis that uncovers several interesting facts in distinguishing pan-cancer samples.
Highlights
(2) Our approach is the first to explicitly address how to learn the feature representation of multiple cancer types’ samples simultaneously
We demonstrate that PC-regularized multi-task learning (RMTL) provides better prediction accuracy than the other competing methods with the differentially expressed (DE) genes and smaller sets of features identified through the coefficients of the trained SVM-Lin and the minimum redundancy maximal relevance’ (MRMR) feature selection algorithm
It provides sound evidence that PC-RMTL can be utilized in the classification task when the expression of a small number of genes is available
Summary
We have identified the key discriminating DE genes in the pan-cancer classification task using the coefficients (weights) of the trained SVM-Lin model. We demonstrate that PC-RMTL provides better prediction accuracy than the other competing methods with the DE genes and smaller sets of features (genes) identified through the coefficients (weights) of the trained SVM-Lin and the MRMR feature selection algorithm.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.