Abstract

Most of the DNA viruses in the gastrointestinal tract are phages, which infect bacterial hosts. Despite phages being the most abundant organisms on Earth, as well as extremely active players in the global ecosystem, much remains unknown about how they function in their natural environments. Advances in whole genome sequencing technologies have generated a large collection of hundreds of phage genomes, allowing deep insight into the genetic evolution of phages, and metagenomics technologies seem to promise more rewarding glimpses into their life cycles and community structures. Recently, we developed an automated approach to assemble a collection of orthologous gene clusters of double-stranded DNA phages (phage orthologous groups, or POGs). This approach follows the well-known clusters of orthologous groups (COGs) framework to identify sets of orthologs by examining top-ranked sequence similarities between proteins in complete genomes without the use of arbitrary similarity cutoffs, and it thus represents a natural system for examining fast-evolving and slow-evolving proteins alike. This automated approach was designed to keep pace with the rapid and accelerating growth of whole genome information from sequencing projects. In particular, we employ a faster graph-theoretical COG-building algorithm that vastly improves our ability to deal with larger numbers of genomes (N) by reducing the worst-case complexity from O(N6) to O(N3 × log N). This system encompasses more than 2,000 groups from the almost 600 known phage genomes deposited at the National Center for Biotechnology Information and is in the process of being expanded to include single-stranded DNA phages and single- and double-stranded RNA phages. Using this approach, we found that more than half of the POGs have no or very few evolutionary connections to their cellular hosts, indicating that these phages combine the ability to share and transduce the host genes with the ability to maintain a large fraction of unique, phage-specific, genes. Such genes are useful for targeted research strategies: for example, as diagnostic indicators and fundamental units of systems biology studies. We employed this set of phage-specific genes to probe the composition of several oceanic metagenomic samples. Although virus-enriched samples indeed contain more homologous matches to phage-specific POGs than a full metagenomic sample also containing cellular DNA, the total gene repertoire of the marine DNA virome is dramatically different from that of known phages. In particular, it is dominated by rare genes, many of which might be contained within viruslike entities such as cellular gene transfer agents rather than true viruses. This result might suggest the necessity of radically rethinking what constitutes the ‘virus world’, because the major component of (marine) viromes could be gene transfer agents that encapsidate bacterial and archaeal genes.

Highlights

  • Despite a decrease in the rate of mortality due to diarrhea in the past few decades, diarrhea remains one of the leading causes of childhood deaths worldwide, especially in developing countries

  • Our simulation shows the following: first, a single-end 454 Jr Titanium run combined with a paired-end 454 Jr Titanium run may assemble about 90% of 100 genomes into

  • We evaluated the performance of ScaffViz on seven datasets of varying size and complexity

Read more

Summary

Introduction

Despite a decrease in the rate of mortality due to diarrhea in the past few decades, diarrhea remains one of the leading causes of childhood deaths worldwide, especially in developing countries. Recent genome-wide association studies (GWAS) have identified allele T of a single nucleotide polymorphism (SNP), rs2294008, in the prostate stem cell antigen (PSCA) gene as a risk factor for bladder cancer [1,2]. A recent genome-wide association study (GWAS) of bladder cancer identified a single nucleotide polymorphism (SNP), rs11892031, within the UGT1A gene cluster on chromosome 2q37.1, as a novel risk factor. Genome-wide association studies (GWAS) of human complex disease have identified a large number of disease-associated genetic loci, which are distinguished by distinctive frequencies of specific single nucleotide polymorphisms (SNPs) in individuals with a particular disease These data do not provide direct information on the biological basis http://genomebiology.com/supplements/12/S1 of a disease or on the underlying mechanisms. There may be multiple paths in the de Bruijn graph that can yield sequences with optical maps that match the genome’s optical map, these paths all yield very similar sequences in most cases

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call