Abstract

The identification of genes associated with a given biological function in plants remains a challenge, although network-based gene prioritization algorithms have been developed for Arabidopsis thaliana and many non-model plant species. Nevertheless, these network-based gene prioritization algorithms have encountered several problems; one in particular is that of unsatisfactory prediction accuracy due to limited network coverage, varying link quality, and/or uncertain network connectivity. Thus, a model that integrates complementary biological data may be expected to increase the prediction accuracy of gene prioritization. Toward this goal, we developed a novel gene prioritization method named RafSee, to rank candidate genes using a random forest algorithm that integrates sequence, evolutionary, and epigenetic features of plants. Subsequently, we proposed an integrative approach named RAP (Rank Aggregation-based data fusion for gene Prioritization), in which an order statistics-based meta-analysis was used to aggregate the rank of the network-based gene prioritization method and RafSee, for accurately prioritizing candidate genes involved in a pre-specific biological function. Finally, we showcased the utility of RAP by prioritizing 380 flowering-time genes in Arabidopsis. The “leave-one-out” cross-validation experiment showed that RafSee could work as a complement to a current state-of-art network-based gene prioritization system (AraNet v2). Moreover, RAP ranked 53.68% (204/380) flowering-time genes higher than AraNet v2, resulting in an 39.46% improvement in term of the first quartile rank. Further evaluations also showed that RAP was effective in prioritizing genes-related to different abiotic stresses. To enhance the usability of RAP for Arabidopsis and non-model plant species, an R package implementing the method is freely available at http://bioinfo.nwafu.edu.cn/software.

Highlights

  • A major challenge in plant biology is to identify the most promising genes from large lists of candidate genes to find those which play an important role in an agricultural trait or a complex biological process (Lee et al, 2010; Li et al, 2015; Sabaghian et al, 2015)

  • To identify candidate flowering-time genes we used seed genes. The latter are a set of genes with a known function in floweringtime control which were collected from four different sources: (1) 293 flowering-time genes annotated in WikiPathways, which is an open, collaborative platform for the curation of pathways by researchers in the entire biology community; (2) 293 flowering-time genes collected by Zhu et al (2011), according to the annotation related to flowering-related traits in The Arabidopsis Information Resource (TAIR) database (TAIR10; version 10; https://www.arabidopsis.org); (3) 406 floweringtime genes manually collected from literatures by Chen et al (2012); (4) 174 flowering-time genes collected co-expression, protein homology, etc. (Jensen et al, 2009)

  • We detected significant differences for hydrophilicity and hydrophobicity patterns of protein sequences corresponding to six Amphiphilic pseudo amino acid composition (APAAC)-related features; these included the third-order factor in term of hydrophilicity of amino acids, the firstorder correlation factor, the second-order correlation factor, up to the fifth-order factor in term of hydrophobicity of amino acids

Read more

Summary

Introduction

A major challenge in plant biology is to identify the most promising genes from large lists of candidate genes (e.g., all genes in the whole genome) to find those which play an important role in an agricultural trait or a complex biological process (Lee et al, 2010; Li et al, 2015; Sabaghian et al, 2015). Gene prioritization was first developed to identify diseaseassociated human genes within a multigene locus identified by a positional genetic study (Perez-Iratxeta et al, 2002). This application was subsequently expanded to studies that generate candidate genes from the whole genome using genome-wide association analyses and “–omics” experiments (Moreau and Tranchevent, 2012). A number of computational approaches and bioinformatics tools have been developed to prioritize diseaserelated human genes with the use of various data sources such as scientific texts, protein-protein interactions, and functional annotations or pathways (Tranchevent et al, 2011; Moreau and Tranchevent, 2012). To the best of our knowledge, none of these approaches and tools designed for human studies can be directly applied to tackle the gene prioritization problem in plants

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.