Abstract

Methods for pairwise ortholog detection (POD) usually relies on alignment-based (AB) similarity measures. However, AB algorithms are still limited in POD since they may fail in the presence of certain evolutionary and genetic events. In this sense, POD is an open field in bioinformatics demanding either constant improvements in existing methods or new effective scaling algorithms to deal with Big Data. In a previous paper, we developed a Big Data supervised POD approach considering several AB pairwise gene features and the low ortholog pair ratios found between two proteomes (Galpert, del Río et al. 2015). Although the higher sensitivity achieved for our supervised POD models in relation to classical POD methodologies, when were comparatively evaluated on the Saccharomycete yeast benchmark dataset (Salichos and Rokas 2011); they were implemented in MapReduce framework and tested on a single yeast genome pair. In (Galpert, Fernández et al. 2018) (https://doi.org/10.1186/s12859-018-2148-8), we propose some improvements to our supervised POD approach by i) surveying the incorporation of alignment-free pairwise similarity measures ii) evaluating other classifiers under the Big Data Spark platform and iii) extending the test set to other related Saccharomycete yeast proteomes.

Highlights

  • The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics

  • The same measures for Reciprocal Best Hits (RBH), Reciprocal Smallest Distance (RSD) and Orthologous MAtrix (OMA) are included in this table

  • This traditional ortholog detection method outperformed most of the supervised algorithms built with alignment-free features except when Random Oversampling pre-processing (ROS) (100% resampling) was applied to Spark Decision Trees in ScerCgla (AUC = 0. 9496)

Read more

Summary

Introduction

The development of new ortholog detection algorithms and the improvement of existing ones are of major importance in functional genomics. Despite several pairwise protein features being combined in a supervised big data approach; they all, to some extent were alignment-based features and the proposed algorithms were evaluated on a unique test set. We aim to evaluate the impact of alignment-free features on the performance of supervised models implemented in the Spark big data platform for pairwise ortholog detection in several related yeast proteomes. High sequence similarity might occur because of convergent evolution or the mere matching chance of non-related short sequences. Such sequences are similar but not homologous [2]. The mentioned pitfalls of homology detection based on sequence similarity are the grounds of the methods known as “alignment-free methods” [4, 5]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call