Abstract

Statistical validation of gene clusters is imperative for many important applications in comparative genomics which depend on the identification of genomic regions that are historically and/or functionally related. We develop the first rigorous statistical treatment of max-gap clusters, a cluster definition frequently used in empirical studies. We present exact expressions for the probability of observing an individual cluster of a set of marked genes in one genome, as well as upper and lower bounds on the probability of observing a cluster of h homologs in a pairwise whole-genome comparison. We demonstrate the utility of our approach by applying it to a whole-genome comparison of E. coli and B. subtilis. Code for statistical tests is available at.

Highlights

  • There are many important applications in genomic comparison that require the identification of homologous regions

  • Our goal is to provide formal statistical models to test the hypothesis that two genomic regions in distantly related genomes share a common ancestor, against a null hypothesis of random gene order

  • We have presented the first rigorous statistical treatment of max-gap clusters, a definition that is frequently used in empirical studies (Blanc et al, 2003; Bourque et al, 2005; Chen et al, 2004; Friedman and Hughes, 2001; Luc et al, 2003; McLysaght et al, 2002; Overbeek et al, 1999; Simillion et al, 2002; Tamames, 2001; Vandepoele et al, 2002; Vision et al, 2000)

Read more

Summary

Introduction

There are many important applications in genomic comparison that require the identification of homologous regions. Researchers are interested in finding conserved groups of genes for identification of large-scale duplications (surveyed by Wolfe [2001]), reconstructing chromosomal rearrangements (surveyed by Sankoff [2003], and by Sankoff and Nadeau [2003]), and phylogenetic reconstruction (Blanchette et al, 1999; Cosner et al, 2000; Hannenhalli, et al, 1995; Sankoff et al, 2000a, b; Tamames et al, 2001), as well as detecting operons, horizontal transfer, and functional selection in bacteria (surveyed by Chen et al [2004], Lawrence and Roth [1996], and Tamames [2001]). While a number of definitions have been proposed, we focus in the current work on the max-gap cluster, a definition that has emerged as perhaps the most popular in empirical studies

Objectives
Methods
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.