Abstract

BackgroundGene families are sets of structurally and evolutionarily related genes – in one or multiple species – that typically share a conserved biological function. As such, the identification and subsequent analyses of entire gene families are widely employed in the fields of evolutionary and functional genomics of both well established and newly sequenced plant genomes. Currently, plant gene families are typically identified using one of two major ways: 1) HMM-profile based searches using models built on Arabidopsis thaliana genes or 2) coding sequence homology searches using curated databases. Integrated databases containing functionally annotated genes and gene families have been developed for model organisms and several important crops; however, a comprehensive methodology for gene family annotation is currently lacking, preventing automated annotation of newly sequenced genomes.ResultsThis paper proposes a combined measure of homology identification, motif conservation, phylogenomic and integrated gene expression analyses to define gene family structures in multiple plant species. The MAP3K gene families in seven plant species, including two currently unexamined species Gossypium hirsutum, and Zostera marina, were characterized to reveal new insights into their collective function and evolution and demonstrate the effectiveness of our novel methodology.ConclusionCompared with recent reports, this methodology performs significantly better for the identification and analysis of gene family members in several monocots/dicots, diploid as well as polyploid plant species.

Highlights

  • Gene families are sets of structurally and evolutionarily related genes – in one or multiple species – that typically share a conserved biological function

  • Cluster database construction The proteomes of Arabidopsis thaliana, Glycine max, Gossypium raimondii, G. hirsutum, Solanum lycopersicum, Zea mays and Zostera marina were gathered from Phytozome and clustered into orthogroups of orthologs and recent paralogs by OrthoMCL as described below in Methods. 382,192 proteins from the seven proteomes were clustered into 40,524 orthogroups, excluding singletons; 63,913 unclustered proteins were appended to this dataset to generate a final set of 104,437 orthogroups

  • Our gene family definition method integrating orthologs clustering and profile Hidden Markov Model (HMM) homology search was in very good agreement with previous large-scale studies on defining gene families in plants

Read more

Summary

Introduction

Gene families are sets of structurally and evolutionarily related genes – in one or multiple species – that typically share a conserved biological function. Instead of relying on individual sequences to query a database, HMM-based searches build a single probabilistic model of an entire gene family using a collection of previously validated sequences Both methods work well at identifying complete gene families, they require extensive manual curation steps where hits are filtered to remove sequences that lack conserved sequence motifs or functional domains. While online databases such as Phytozome, PLAZA, and GreenPhylDB have been described as the highest performing gene family identification tool currently available [13], they often either include erroneously identified sequence hits, lack appropriate annotations necessary for accurate gene family identification, or exclude from analyses many newly sequenced species

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call