Histogram-based DNA analysis for the visualization of chromosome, genome and species information

António M Costa,José T Machado,Maria D Quelhas

doi:10.1093/bioinformatics/btr131

Abstract

We describe a novel approach to explore DNA nucleotide sequence data, aiming to produce high-level categorical and structural information about the underlying chromosomes, genomes and species. The article starts by analyzing chromosomal data through histograms using fixed length DNA sequences. After creating the DNA-related histograms, a correlation between pairs of histograms is computed, producing a global correlation matrix. These data are then used as input to several data processing methods for information extraction and tabular/graphical output generation. A set of 18 species is processed and the extensive results reveal that the proposed method is able to generate significant and diversified outputs, in good accordance with current scientific knowledge in domains such as genomics and phylogenetics. Source code freely available for download at http://www4.dei.isep.ipp.pt/etc/dnapaper2010, implemented in Free Pascal and UNIX scripting tools. Study input data available online for download at University of California at Santa Cruz Genome Bioinformatics, http://hgdownload.cse.ucsc.edu/downloads.html.

Full Text