Abstract
BackgroundWith the availability of well-assembled genomes of a growing number of organisms, identifying the bioinformatic basis of whole genome duplication (WGD) is a growing field of genomics. The most extant software for detecting footprints of WGDs has been restricted to a well-assembled genome. However, the massive poor quality genomes and the more accessible transcriptomes have been largely ignored, and in theoretically they are also likely to contribute to detect WGD using dS based method. Here, to resolve these problems, we have designed a universal and simple technical tool WGDdetector for detecting WGDs using either genome or transcriptome annotations in different organisms based on the widely used dS based method.ResultsWe have constructed WGDdetector pipeline that integrates all analyses including gene family constructing, dS estimating and phasing, and outputting the dS values of each paralogs pairs processed with only one command. We further chose four species (Arabidopsis thaliana, Juglans regia, Populus trichocarpa and Xenopus laevis) representing herb, wood and animal, to test its practicability. Our final results showed a high degree of accuracy with the previous studies using both genome and transcriptome data.ConclusionWGDdetector is not only reliable and stable for genome data, but also a new way to using the transcriptome data to obtain the correct dS distribution for detecting WGD. The source code is freely available, and is implemented in Windows and Linux operation system.
Highlights
With the availability of well-assembled genomes of a growing number of organisms, identifying the bioinformatic basis of whole genome duplication (WGD) is a growing field of genomics
Four organisms’ genome or/and transcriptome datasets were selected to evaluate the performance of WGDdetector, including three plants (Arabidopsis thaliana, Juglans regia and Populus trichocarpa) and one frog (Xenopus laevis) (Table 1 and Additional file 1: Table S1)
A total of 27,301, 32,436, 39,410 and 41,073 genes satisfied our criteria in A. thaliana, J. regia, P. trichocarpa and X. laevis, respectively: retaining the longest coding sequence (CDS) for each gene, removing CDS with premature stop codons and those protein sequences < 50 amino acids (AA)
Summary
With the availability of well-assembled genomes of a growing number of organisms, identifying the bioinformatic basis of whole genome duplication (WGD) is a growing field of genomics. The massive poor quality genomes and the more accessible transcriptomes have been largely ignored, and in theoretically they are likely to contribute to detect WGD using dS based method. With a growing number of published draft genomes, two other methods based on phylogenetics [4, 16] and distribution of pairwise paralogs synonymous substitutions per synonymous site (dS) are more suitable [17, 18]. For the former, the WGDs are estimated through the gene count data where the number of gene copies in various gene families across a
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.