Towards cognitive analysis of DNA

Witold Kinsner

doi:10.1109/coginf.2010.5599728

Abstract

Deoxyribonucleic acid (DNA) has become one of the most examined molecules on the planet. Scientist around the world have been trying to unravel its secrets for many purposes. For example, genetic information is currently used to raise better plants and animals, create enhanced pharmaceuticals for humans, and for gene therapy in medicine. Science as a whole has benefited from the study of genetics because of the increased understanding of biological process that all organisms share. In recent decades, a significant amount of research has been directed towards sequencing and understanding the entire human genome through the Human Genome Project (HGP) launched in 1986. The goal of the HGP was to find the location of the approximately 1×105 human genes, and read all the sequence of human genome (about 3×109 base pairs, bp). An exponential grow rate of that research has resulted in reaching the goal by 2003. Similarly, the speed of finding genes and their locations is also increasing rapidly. On the other hand, the traditional methods of finding genes and their location at chromatosomes through testing their biological function have been inherently slow. Although numerous faster techniques have been developed, there is still a need to augment them with new approaches. Therefore, robust computational solutions to the gene-finding problem could provide a valuable resource for the HGP and for the molecular-biology community. Most of the current research in the deciphering the meaning of DNA sequences is approached from the lowest base-pair level. Its main objective is to search for patterns or correlations existing in the DNA sequence related to codons, amino acids, and proteins. A number of gene-finding systems have been developed in recent decades. These systems use a variety of sophisticated computational data-miming techniques, including neural networks, dynamic programming, rule-based methods, decision trees, probability reasoning, hidden Markov chains, genetic programming, and support vector machines. Most of these approaches are based on local measures only. In addition, many of the techniques rely on the statistical qualities of exons in the gene, thus using only the known gene pool as a training set for their classification. Although the techniques have demonstrated limited success, better techniques should be developed. An approach to finding such improved techniques is to consider long-range relations (in addition to short-range relations) in the DNA sequence, spanning 104 nucleotides. If we had a good technique to measure such long-range relations, we would be able to estimate any existing self-affinity (fractality) in the DNA sequence, without any a priori assumptions about its structure. This would be a data-driven approach, rather than the common modeldriven approach. Along those lines, preliminary results have already been reported in the literature on a local self-similarity with a 180 bp periodicity in mammalian nuclear DNA sequence. Other publications have provided evidence that the long-range fractal correlations appear in DNA sequences with different values in different regions of the sequence. This paper describes such a multiscale approach, together with an algorithm based on a multifractal analysis, and demonstrates that multifractal estimates can be used to characterize DNA sequences [1], [2], [3]. This multifractal approach appears to be new, and may provide a key to cognitive analysis of DNA sequences. It should be clear that the DNA sequencing and gene finding techniques constitute a subset of bioinformatics, the science of using information to understand biology, with its numerous tools. In turn, bioinformatics is a subset of computational biology which is the application of quantitative analytical techniques in modelling biological systems. Very often, for structural biologists, DNA is not just a sequence of symbols, but implies 3D structures, molecular shapes and conformational changes, active sites, chemical reactions, and intermolecular interactions, each with a hierarchy of importance. What might be missing from this web of various approaches to solving life-related problems is cognitive informatics with its new approaches [4].

Full Text