Abstract

Orthology detection still requires more effective scaling algorithms. Combinations of alignment, synteny, evolutionary distances and protein interactions have been used in different unsupervised algorithms to improve effectiveness while many available databases are concerned with the scaling problem. In this paper, a set of gene pair features based on similarity measures, such as alignment scores, sequence length, gene membership to conserved regions and physicochemical profiles are combined in a supervised Pairwise Ortholog Detection (POD) approach to improve effectiveness considering low ortholog ratios in relation to all possible pairwise comparisons between two genomes. In this POD scenario, big data supervised classifiers managing imbalance between ortholog and non-ortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach for POD was compared with Reciprocal Best Hits (RBH), Reciprocal Smallest Distance (RSD) and a Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data (OMA) algorithms by using (i) Saccharomyces cerevisiae - Kluyveromcyes lactis, (ii) Saccharomyces cerevisiae - Candida glabrata and (iii) Saccharomyces cerevisiae - Schizosaccharomyces pombe yeast genome pairs as benchmark datasets. Four datasets derived from each genome pair comparison with different alignment settings were used. Because of the large amount of instances (gene pairs) and the data imbalance, the building and testing of the supervised model was only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark Support Vector Machines outperformed RBH, RSD and OMA, probably, because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.

Highlights

  • Ortholog detection (OD) algorithms should that this increase in proteome data brings out the distinguish orthologous genes from other types need to work out efficient but effective of homologs such as paralogs evolving from a OD algorithms

  • The process separates the of each genome pair were built from pairs into train and test sets and calculates combinations of alignment parameter settings

  • By applying evaluation metrics such as G-mean, AUC and the balance between TPRate and TNRate, our results show that gene pairwise feature combinations provide excellent pairwise OD (POD) in a big data supervised scenario that consider data imbalance

Read more

Summary

Introduction

Ortholog detection (OD) algorithms should that this increase in proteome data brings out the distinguish orthologous genes from other types need to work out efficient but effective of homologs such as paralogs evolving from a OD algorithms. A computational demands in sequence analyses is great deal of unsupervised graph-based not met by an increase in computational approaches has been developed to identify capacities but rather calls for new approaches or orthologs resulting in corresponding repositories algorithmic implementations [4].

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call