Abstract

BackgroundFeature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems.ResultsTwo principal component analysis (PCA)-based FE, specifically, variational Bayes PCA (VBPCA) was extended to perform unsupervised FE, and together with conventional PCA (CPCA)-based unsupervised FE, were tested as sample classification independent unsupervised FE methods. VBPCA- and CPCA-based unsupervised FE both performed well when applied to simulated data, and a posttraumatic stress disorder (PTSD)-mediated heart disease data set that had multiple categorical class observations in mRNA/microRNA expression of stressed mouse heart. A critical set of PTSD miRNAs/mRNAs were identified that show aberrant expression between treatment and control samples, and significant, negative correlation with one another. Moreover, greater stability and biological feasibility than conventional supervised FE was also demonstrated. Based on the results obtained, in silico drug discovery was performed as translational validation of the methods.ConclusionsOur two proposed unsupervised FE methods (CPCA- and VBPCA-based) worked well on simulated data, and outperformed two conventional supervised FE methods on a real data set. Thus, these two methods have suggested equivalence for FE on categorical multiclass data sets, with potential translational utility for in silico drug discovery.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0574-4) contains supplementary material, which is available to authorized users.

Highlights

  • Feature extraction (FE) is difficult, if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting

  • By applying Conventional principal component analysis (PCA)-based unsupervised FE (CPCAFE) and variational Bayes PCA (VBPCA)-based unsupervised FE (VBPCAFE) to publically available mRNA and microRNA expression data [8] from stressed mouse hearts, we identified aberrantly expressed miRNAs and mRNAs

  • The data set consisted of 100 simulated ensembles of 20 samples with 100 features, of which only 10 features were distinct between four classes, and with each class consisting of 5 samples

Read more

Summary

Introduction

Feature extraction (FE) is difficult, if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems. Dimensional reduction is often used to eliminate redundancy, it is far from true. FE can be used to eliminate redundancy, and is often performed to maximize performance of targeted tasks (supervised FE), e.g., discrimination between samples or regression analysis, fewer samples than features often creates difficulties due to overfitting and/or bias. Multiple class samples commonly provide additional problems when supervised FE is used, complicating performance evaluations compared with two-class samples.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call