Abstract

BackgroundEach genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each genome and termed the genome's barcode.ResultsWe found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.ConclusionThese and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.

Highlights

  • Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1

  • The challenges being faced in sorting out short genomic fragments generated by metagenome sequencing projects [1] pose a fundamental question: "does each genome have a unique signature imprinted on its short sequence fragments so that fragments from the same genomes in a metagenome can be identified accurately?" A positive answer to this question could have significant implications to many important genome and metagenome analysis problems such as identification of genetic material transferred from other organisms [2] or through virus invasions [3,4], separation of short sequence fragments generated by metagenome sequencing into individual genomes [5] and phylogenetic analyses of genomes [6]

  • BMC Bioinformatics 2008, 9:546 http://www.biomedcentral.com/1471-2105/9/546 distributions, and have observed that the di-nucleotide relative abundance, a normalized di-mer frequency with respect to the mono-mer frequencies, is generally stable across a genome measured on 50 K base-pair fragments [11,12,13]

Read more

Summary

Introduction

Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call