Abstract

BackgroundComparative studies using hundreds of sequences can give a detailed picture of the evolution of a given gene family. Nevertheless, retrieving only the sequences of interest from public databases can be difficult, in particular, when working with highly divergent sequences. The difficulty increases substantially when one wants to include in the study sequences from many (or less well studied) species whose genomes are non-annotated or incompletely annotated.Methodology/Principal FindingsIn this work we evaluate the usefulness of different approaches of gene retrieval and classification, using the distal-less (DLX) gene family as a test case. Furthermore, we evaluate whether the use of a large number of gene sequences from a wide range of animal species, the use of multiple alternative alignments, and the use of amino acids aligned with high confidence only, is enough to recover the accepted DLX evolutionary history.Conclusions/SignificanceThe canonical DLX homeobox gene sequence here derived, together with the characteristic amino acid variants here identified in the DLX homeodomain region, can be used to retrieve and classify DLX genes in a simple and efficient way. A program is made available that allows the easy retrieval of synteny information that can be used to classify gene sequences. Maximum likelihood trees using hundreds of sequences can be used for gene identification. Nevertheless, for the DLX case, the proposed DLX evolutionary is not recovered even when multiple alignment algorithms are used.

Highlights

  • The first step is often to collect the sequences of interest from very divergent species from public databases using BLAST

  • The vast majority of DLX sequences compiled by ENSEMBL show the TQTQV amino acid motif, including one sequence from the nonbilaterian species Trichoplax adhaerens

  • Retrieving only the sequences of interest by BLAST from public databases can be a challenging problem when working with highly divergent species

Read more

Summary

Introduction

The first step is often to collect the sequences of interest from very divergent species from public databases using BLAST. When genes belong to large gene families, showing homology with many different genes, retrieving only the sequences of interest can be difficult. In order to confirm the identity of the retrieved sequences, a phylogenetic approach should always be used. Different multiple sequence alignment (MSA) algorithms can produce different alignments, which in turn can influence the inferred phylogenetic reconstruction and lead to different conclusions. Golubchik et al ([3]) showed that the absence of amino acid residues often leads to an incorrect placement of gaps in the alignments, even when the sequences were otherwise identical, and, for a given alignment, not all amino acid positions will be aligned with equal confidence [4]. The difficulty increases substantially when one wants to include in the study sequences from many (or less well studied) species whose genomes are non-annotated or incompletely annotated

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call