Abstract

BackgroundWith rapid advances in genome sequencing and bioinformatics, it is now possible to generate phylogenetic trees containing thousands of operational taxonomic units (OTUs) from a wide range of organisms. However, use of rigorous tree-building methods on such large datasets is prohibitive and manual ‘pruning’ of sequence alignments is time consuming and raises concerns over reproducibility. There is a need for bioinformatic tools with which to objectively carry out such pruning procedures.FindingsHere we present ‘TreeTrimmer’, a bioinformatics procedure that removes unnecessary redundancy in large phylogenetic datasets, alleviating the size effect on more rigorous downstream analyses. The method identifies and removes user-defined ‘redundant’ sequences, e.g., orthologous sequences from closely related organisms and ‘recently’ evolved lineage-specific paralogs. Representative OTUs are retained for more rigorous re-analysis.ConclusionsTreeTrimmer reduces the OTU density of phylogenetic trees without sacrificing taxonomic diversity while retaining the original tree topology, thereby speeding up downstream computer-intensive analyses, e.g., Bayesian and maximum likelihood tree reconstructions, in a reproducible fashion.

Highlights

  • With rapid advances in genome sequencing and bioinformatics, it is possible to generate phylogenetic trees containing thousands of operational taxonomic units (OTUs) from a wide range of organisms

  • The parameter input file is used to up-weight or down-weight different taxonomic categories by specifying how many OTUs should be retained after the dereplication procedure

  • We developed a tree-based dereplication method for pruning redundant OTUs from phylogenetic datasets based on support values, branch lengths and taxonomic information linked to each sequence

Read more

Summary

Introduction

With rapid advances in genome sequencing and bioinformatics, it is possible to generate phylogenetic trees containing thousands of operational taxonomic units (OTUs) from a wide range of organisms. Use of rigorous tree-building methods on such large datasets is prohibitive and manual ‘pruning’ of sequence alignments is time consuming and raises concerns over reproducibility. With advances in high-throughput genome and transcriptome sequencing it is possible to construct trees from nucleic acid and protein sequence alignments containing thousands of OTUs. Despite the obvious potential for improving our understanding of the history of modern-day organisms and their genomes, an important downside of this ‘embarrassment of riches’ is the fact that many phylogenetic trees are produced using datasets that have been trimmed down to a ‘manageable’ size for methodological and/or presentation purposes [1,2]. A large number of similarity search hits are often retrieved iteratively and sorted, with the user manually retaining sequences from taxa of interest along with a few arbitrarily

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.