Abstract

BackgroundLearning accurate models from ‘omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Furthermore, the presence of outliers, either experimental errors or interesting abnormal clinical cases, may severely hamper a correct classification of patients and the identification of reliable biomarkers for a particular disease. We propose to address this problem through an ensemble classification setting based on distinct feature selection and modeling strategies, including logistic regression with elastic net regularization, Sparse Partial Least Squares - Discriminant Analysis (SPLS-DA) and Sparse Generalized PLS (SGPLS), coupled with an evaluation of the individuals’ outlierness based on the Cook’s distance. The consensus is achieved with the Rank Product statistics corrected for multiple testing, which gives a final list of sorted observations by their outlierness level.ResultsWe applied this strategy for the classification of Triple-Negative Breast Cancer (TNBC) RNA-Seq and clinical data from the Cancer Genome Atlas (TCGA). The detected 24 outliers were identified as putative mislabeled samples, corresponding to individuals with discrepant clinical labels for the HER2 receptor, but also individuals with abnormal expression values of ER, PR and HER2, contradictory with the corresponding clinical labels, which may invalidate the initial TNBC label. Moreover, the model consensus approach leads to the selection of a set of genes that may be linked to the disease. These results are robust to a resampling approach, either by selecting a subset of patients or a subset of genes, with a significant overlap of the outlier patients identified.ConclusionsThe proposed ensemble outlier detection approach constitutes a robust procedure to identify abnormal cases and consensus covariates, which may improve biomarker selection for precision medicine applications. The method can also be easily extended to other regression models and datasets.

Highlights

  • Learning accurate models from ‘omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems

  • Triple-Negative Breast Cancer (TNBC) data Exploratory analysis A first logistic regression model based on the 3 variables clinically used to classify patients yielded significance only for variables estrogen receptor (ER) and human epidermal growth factor receptor type 2 (HER2)

  • When looking for potential confounding variables before getting into outlier detection based on gene expression data, univariate logistic regression and the Fisher’s exact test identified race and age as significant (p < 0.05) for the outcome (TNBC vs. non-TNBC)

Read more

Summary

Introduction

Learning accurate models from ‘omics data is bringing many challenges due to their inherent high-dimensionality, e.g. the number of gene expression variables, and comparatively lower sample sizes, which leads to ill-posed inverse problems. Lopes et al BMC Bioinformatics (2018) 19:168 This flood of ’omics data brings many challenges when learning regression models: first, genomic datasets are high-dimensional, corresponding to measurements of thousands genes (the p covariates) for each individual, often highly correlated and outnumbering the cases enrolled for the study, N. This crucial N p or high-dimensional problem, which occurs very frequently in patientomics data, may cause instability in the selected driver genes and poor performance of predictive models [4]; second, genomic data usually contain abnormal variable measurements arriving from many sources (e.g., experiment errors), which might be regarded as potential outliers that may end-up in a incorrect labeling/classification of the patients and, precipitate failure in the cancer treatment. Outlier patients must be identified, so that further investigation on these patients is undertaken

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.