Abstract
The large p small n problem is a challenge without a de facto standard method available to it. In this study, we propose a tensor-decomposition (TD)-based unsupervised feature extraction (FE) formalism applied to multiomics datasets, in which the number of features is more than 100,000 whereas the number of samples is as small as about 100, hence constituting a typical large p small n problem. The proposed TD-based unsupervised FE outperformed other conventional supervised feature selection methods, random forest, categorical regression (also known as analysis of variance, or ANOVA), penalized linear discriminant analysis, and two unsupervised methods, multiple non-negative matrix factorization and principal component analysis (PCA) based unsupervised FE when applied to synthetic datasets and four methods other than PCA based unsupervised FE when applied to multiomics datasets. The genes selected by TD-based unsupervised FE were enriched in genes known to be related to tissues and transcription factors measured. TD-based unsupervised FE was demonstrated to be not only the superior feature selection method but also the method that can select biologically reliable genes. To our knowledge, this is the first study in which TD-based unsupervised FE has been successfully applied to the integration of this variety of multiomics measurements.
Highlights
The term “big data” often indicates a high number of instances as well as features [1,2].Typical big data comprise a few million images each composed of several million pixels.The number of features is as high as the number of instances, or often higher
Taguchi has proposed a very different method to the typical machine-learning methods that are applicable to large p small n problems: tensor-decomposition (TD)-based unsupervised feature extraction (FE) [17]. m-mode tensor is associated with more than two suffix whereas matrix is associated with two suffix, row and column
As many as 1785 protein-coding genes can be counted in these regions, which is much higher than expected. This indicates that TD-based unsupervised FE can select genomic regions that include protein-coding genes, correctly considering the altered multiomics variables between normal and tumor tissues, non-coding RNAs have a key role in regulating the behavior of cells and their over- and underexpression strongly correlated with cancer
Summary
The term “big data” often indicates a high number of instances as well as features [1,2]. Taguchi has proposed a very different method to the typical machine-learning methods that are applicable to large p small n problems: tensor-decomposition (TD)-based unsupervised feature extraction (FE) [17]. M-mode tensor is associated with more than two suffix whereas matrix is associated with two suffix, row and column In this method, a smaller number of representative features, referred to as singular value vectors, are generated with linear combinations of the original large number of features, without considering labeling. “multiple tissues”, since tensor is more reasonable format than matrix, TD based unsupervised FE is more suitable method than PCA based unsupervised FE
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have