Abstract
In the last decade, advances in high-throughput technologies such as DNA microarrays have made it possible to simultaneously measure the expression levels of tens of thousands of genes and proteins. This has resulted in large amounts of biological data requiring analysis and interpretation. Nonnegative matrix factorization (NMF) was introduced as an unsupervised, parts-based learning paradigm involving the decomposition of a nonnegative matrix V into two nonnegative matrices, W and H, via a multiplicative updates algorithm. In the context of a p×n gene expression matrix V consisting of observations on p genes from n samples, each column of W defines a metagene, and each column of H represents the metagene expression pattern of the corresponding sample. NMF has been primarily applied in an unsupervised setting in image and natural language processing. More recently, it has been successfully utilized in a variety of applications in computational biology. Examples include molecular pattern discovery, class comparison and prediction, cross-platform and cross-species analysis, functional characterization of genes and biomedical informatics. In this paper, we review this method as a data analytical and interpretive tool in computational biology with an emphasis on these applications.
Highlights
The rapid development in high-throughput technologies in the past decade has given rise to large-scale biological data in the form of expression profiles of tens of thousands of genes and proteins, often with only a handful of tissue samples
The objective is to identify differentially expressed genes between the different classes of interest; in class prediction, the emphasis is on building a predictive gene set based on the class labels and expression profiles of known samples, and to apply it to a new sample to predict its class
We review nonnegative matrix factorization (NMF) and its applications in computational biology, with an emphasis on the analysis and interpretation of high-throughput biological data such as those above
Summary
The rapid development in high-throughput technologies in the past decade has given rise to large-scale biological data in the form of expression profiles of tens of thousands of genes and proteins, often with only a handful of tissue samples. Dimensionality reduction and visualization are key aspects in effectively analyzing and interpreting the high-dimensional data in this setting. Such unsupervised approaches are useful and relevant when there is no a priori knowledge of the expected gene expression patterns for a given set of genes or for any phenotype (such as experimental condition, tissue type, or patient). In studies where such prior knowledge is available, the focus is on class comparison or class prediction. We examine the usefulness of its stochastic nature in selecting an appropriate model for a given dataset and for faster implementation of the algorithm
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.