Data Mining in Genomics and Proteomics

Halima Bensmail,Abdelali Haoudi

doi:10.1155/jbb.2005.63

Abstract

There is no doubt that both computational biology and bioinformatics, and the interface of computer science and biology in general, are central to the future of biological research. The disciplines span a process that begins with data collection, analysis, classification, and integration, and ends with interpretation, modeling, visualization, and prediction. Data mining plays a role in the middle of this process. Overall, the focus is on identifying opportunities and developing computational solutions (including algorithms, models, tools, and databases) that can be used for experimental design, data analysis and interpretation, and hypothesis generation. Data mining is the search for hidden trends within large sets of data. Data mining approaches are needed at all levels of genomics and proteomics analyses. These studies can provide a wealth of information and rapidly generate large quantities of data from the analysis of biological specimens from healthy and diseased tissues. The high dimensionality of data generated from these studies will require the development of improved bioinformatics and computational biology tools for efficient and accurate data analyses. This issue of the Journal of Biomedicine and Biotechnology consists of seventeen papers that describe different applications of data mining to both genomics and proteomics studies in yeast, and plant and human cells and tissues. Papers by Bensmail et al, Ghosh and Chinnaiyan, and Mao et al present different classification and clustering approaches for disease biomarkers discovery. Genomics and proteomics studies have shown great promises and have been applied to studies aiming at generating expression profiles and elucidating expression networks in different organisms as shown in the papers by Samsa et al, Mungur et al, Liu et al, Baldwin et al, and Joy et al. Data mining in genomics and proteomics studies reveals new regulatory pathways and mechanisms in different health and disease conditions as presented by Wren and Garner, and provides comparative sequence analysis approaches as presented by Gambin and Otto and Gao et al. Those studies have also provided approaches for subcellular localization of proteins suggesting that such approaches can produce an objective systematics for protein location and provide an important starting point for discovering sequence motifs that determine localization as presented by Chen and Murphy. Chen et al studied the performance of five nonparameteric tests to select genes and proved that the popular F test does not perform well on gene expression data since the heterogeneity behavior assumption is the most dominant in the gene expression data. Corder et al explored a statistical approach called grade of membership (GOM) and proved that brain hypoperfusion contributes to dementia, possibly to Alzheimer's disease (AD) pathogenesis, and raises the possibility that the APOE ϵ4 allele contributes directly to heart value and myocardial damage. Hand and Heard present in their review article various tools for finding relevant subgroups in gene expression data. Alkharouf et al conduct an OLAP cube (online analytical processing) to mine a time series experiment designed to identify genes associated with resistance of soybean to the soybean cyst nematode, which is a devastating pest of soybean. Brylinski et al created a sequence-to-structure library based on the complete PDB database. Then an early-stage folding conformation and information entropy were used for structure analysis and classification. Whilst postgenomic science is producing vast data torrents, it is well known that data do not equal knowledge and so the extraction of the most meaningful parts of these data is key to the generation of useful new knowledge. More sophisticated data mining strategies are needed for mining such high-dimensional data to generate useful relationships, rules, and predictions.

Full Text