Computational DNA sequence analysis.

Samuel Karlin,Lon R Cardon

doi:10.1146/annurev.mi.48.100194.003155

Abstract

This paper reviews several new developments in computer and statistical analysis of DNA and protein sequences. We present criteria and describe means for assessing and interpreting genomic inhomogeneities within and between sequences. These include: (a) characterizations of short oligonucleotide biases and general compositional tendencies; (b) molecular evolutionary reconstructions based on dinucleotide relative abundance distance measures and partial orderings; and (c) the application of r-scan statistics, quantile distributions, and score-based analyses to identify clustering, overdispersion, and excessive evenness in the distribution of a marker array along a sequence. These apply, for example, to restriction sites, microsatellite runs, regulatory motifs, and nucleosome placements. Furthermore, (d) the definition and determination of rare and frequent oligonucleotides and peptides provides another perspective on sequence heterogeneity, and (e) score methods are also applied in exon and gene locations. Most of the ideas and methods are illustrated with respect to bacteriophage genomes, to megabase amounts of several eukaryotic sequences, to a diverse collection of bacterial sets, to mitochondrial chromosomes, and to a broad assembly of viral genomes.

Full Text