Abstract
BackgroundMolecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). Given such two-class high-dimensional data, many methods have been proposed for classifying data points into one of the two classes. However, finding very small sets of features able to correctly classify the data is problematic as the fundamental mathematical proposition is hard. Existing methods can find "small" feature sets, but give no hint how close this is to the true minimum size. Without fundamental mathematical advances, finding true minimum-size sets will remain elusive, and more importantly for the microarray community there will be no methods for finding them.ResultsWe use the brute force approach of exhaustive search through all genes, gene pairs (and for some data sets gene triples). Each unique gene combination is analyzed with a few-parameter linear-hyperplane classification method looking for those combinations that form training error-free classifiers. All 10 published data sets studied are found to contain predictive small feature sets. Four contain thousands of gene pairs and 6 have single genes that perfectly discriminate.ConclusionThis technique discovered small sets of genes (3 or less) in published data that form accurate classifiers, yet were not reported in the prior publications. This could be a common characteristic of microarray data, thus making looking for them worth the computational cost. Such small gene sets could indicate biomarkers and portend simple medical diagnostic tests. We recommend checking for small gene sets routinely. We find 4 gene pairs and many gene triples in the large hepatocellular carcinoma (HCC, Liver cancer) data set of Chen et al. The key component of these is the "placental gene of unknown function", PLAC8. Our HMM modeling indicates PLAC8 might have a domain like part of lP59's crystal structure (a Non-Covalent Endonuclease lii-Dna Complex). The previously identified HCC biomarker gene, glypican 3 (GPC3), is part of an accurate gene triple involving MT1E and ARHE. We also find small gene sets that distinguish leukemia subtypes in the large pediatric acute lymphoblastic leukemia cancer set of Yeoh et al.
Highlights
Molecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues
Small sets of genes able to accurately classify two-class microarray data occur in many real-world data sets
Many members of the pairs discovered here have known associations with cancer and indicate the possibility of simple, accurate medical diagnostic tests based on such results
Summary
Molecular profiling generates abundance measurements for thousands of gene transcripts in biological samples such as normal and tumor tissues (data points). A recent exemplar employed cDNA microarrays to assay 6605 clones from normal liver and liver cancer (hepatocellular carcinoma) tissues [1] Given such two-class high-dimensional data, one analytical task is identifying a "small" subset of (page number not for citation purposes). Existing classification and feature selection techniques can be employed to ascertain the cardinality of a feature subset yielding a classifier that generalizes well, i.e., one which makes zero (or few) errors in assigning the class of an unseen data point Application of these approaches to a data set results in the definition of one discriminatory subset with tens to hundreds of features and requiring similar numbers of free parameters. A multiplicity of error-free linear classifiers constructed from few features could facilitate the creation of cost-effective clinical tests and guide further basic research
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.