Comparative Analysis of DNA Word Abundances in Four Yeast Genomes Using a Novel Statistical Background Model

Ramkumar Hariharan,M Radhakrishna Pillai,Reji Simon,Todd D Taylor

doi:10.1371/journal.pone.0058038

Ramkumar Hariharan, M Radhakrishna Pillai + Show 2 more

Open Access

https://doi.org/10.1371/journal.pone.0058038

Copy DOI

Journal: PLoS ONE	Publication Date: Mar 5, 2013
Citations: 27	License type: CC BY 4.0

Affiliation: Rajiv Gandhi Centre for Biotechnology

Abstract

Previous studies have shown that the identification and analysis of both abundant and rare k-mers or “DNA words of length k” in genomic sequences using suitable statistical background models can reveal biologically significant sequence elements. Other studies have investigated the uni/multimodal distribution of k-mer abundances or “k-mer spectra” in different DNA sequences. However, the existing background models are affected to varying extents by compositional bias. Moreover, the distribution of k-mer abundances in the context of related genomes has not been studied previously. Here, we present a novel statistical background model for calculating k-mer enrichment in DNA sequences based on the average of the frequencies of the two (k-1) mers for each k-mer. Comparison of our null model with the commonly used ones, including Markov models of different orders and the single mismatch model, shows that our method is more robust to compositional AT-rich bias and detects many additional, repeat-poor over-abundant k-mers that are biologically meaningful. Analysis of overrepresented genomic k-mers (4≤k≤16) from four yeast species using this model showed that the fraction of overrepresented DNA words falls linearly as k increases; however, a significant number of overabundant k-mers exists at higher values of k. Finally, comparative analysis of k-mer abundance scores across four yeast species revealed a mixture of unimodal and multimodal spectra for the various genomic sub-regions analyzed.

Highlights

The availability of completely sequenced genomes has made possible empirical, as opposed to the earlier theoretical, studies of the distributions of ‘‘DNA words’’ or ‘‘k-mers of length k’’ in genomic DNA sequences [1,2,3,4,5]
Because the detection of overrepresented k-mers with commonly used background models are biased toward AT-richness and repeat-rich motifs, we developed a novel statistical background model to calculate k-mer fold enrichment scores
We carried out the following steps: (a) we first determined the number of occurrences of the two (k-1) mers corresponding to each k-mer, (b) we computed the expected frequencies of each k-mer by multiplying the (k-1) mer frequencies with that of the remaining nucleotide that makes up the k-mer, (c) the fold enrichment scores, F1 and F2, (based on each (k-1) mer), were calculated as the ratio of the observed number of occurrences of a k-mer to its expected number of occurrences, (d) the average of the two fold enrichment scores was taken to obtain the fold enrichment score for each k-mer and, (e) the Z-score was calculated on the fold enrichment score

Summary

Introduction

The availability of completely sequenced genomes has made possible empirical, as opposed to the earlier theoretical, studies of the distributions of ‘‘DNA words’’ or ‘‘k-mers of length k’’ in genomic DNA sequences [1,2,3,4,5]. Apart from a few recent studies [4,5], the vast majority of investigations in this area have attempted to analyze over- or underrepresented k-mers in different genomic regions. While a few of these studies have attempted to identify and catalog the set of missing elements (dubbed ‘‘nullomers’’) in genomes [6,7,8] others have focused on detecting over-represented k-mers in select genomic regions for the identification of functional elements [9,10,11,12,13,14,15]. Different background models have been proposed for calculating k-mer distributions in random sequences. It has been noted that the existing background models have varying degrees of AT-rich compositional bias, i. e., the list of over-represented k-mers identified by each model is likely to contain significantly more AT-rich elements if the input genomic sequences are AT-rich, and vice versa

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Comparative Analysis of DNA Word Abundances in Four Yeast Genomes Using a Novel Statistical Background Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Genomic DNA k-mer spectra: models and modalities
Benny Chor ... Tim Massingham
Genome Biology | VOL. 10
Benny Chor, et. al.Benny Chor ... Tim Massingham
01 Jan 2009
Genome Biology | VOL. 10

Genomic DNA k-mer Spectra: Models and Modalities
Benny Chor ... Yaron Levy
-
Benny Chor, et. al.Benny Chor ... Yaron Levy
01 Jan 2009
01 Jan 2009

Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions.
Peter G Foster ... Donal A Hickey
Journal of Molecular Evolution | VOL. 48
Peter G Foster, et. al.Peter G Foster ... Donal A Hickey
01 Mar 1999
Journal of Molecular Evolution | VOL. 48

Efficient Field Programmable Gate Array Implementation for Moving Object Segmentation using BMFCM
Siva Nagi Reddy Kalli ... Bhanu Murthy Bhaskara
Indian Journal of Science and Technology | VOL. 10
Siva Nagi Reddy Kalli, et. al.Siva Nagi Reddy Kalli ... Bhanu Murthy Bhaskara
03 Jan 2017
Indian Journal of Science and Technology | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparative Analysis of DNA Word Abundances in Four Yeast Genomes Using a Novel Statistical Background Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE