An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species.

Deborah Galpert,Agostinho Antunes,Sara Del Río,Guillermin Agüero-Chapin,Francisco Herrera,Evys Ancede-Gallardo

doi:10.1155/2015/748681

Deborah Galpert, Agostinho Antunes + Show 4 more

Open Access

https://doi.org/10.1155/2015/748681

Copy DOI

Abstract

Orthology detection requires more effective scaling algorithms. In this paper, a set of gene pair features based on similarity measures (alignment scores, sequence length, gene membership to conserved regions, and physicochemical profiles) are combined in a supervised pairwise ortholog detection approach to improve effectiveness considering low ortholog ratios in relation to the possible pairwise comparison between two genomes. In this scenario, big data supervised classifiers managing imbalance between ortholog and nonortholog pair classes allow for an effective scaling solution built from two genomes and extended to other genome pairs. The supervised approach was compared with RBH, RSD, and OMA algorithms by using the following yeast genome pairs: Saccharomyces cerevisiae-Kluyveromyces lactis, Saccharomyces cerevisiae-Candida glabrata, and Saccharomyces cerevisiae-Schizosaccharomyces pombe as benchmark datasets. Because of the large amount of imbalanced data, the building and testing of the supervised model were only possible by using big data supervised classifiers managing imbalance. Evaluation metrics taking low ortholog ratios into account were applied. From the effectiveness perspective, MapReduce Random Oversampling combined with Spark SVM outperformed RBH, RSD, and OMA, probably because of the consideration of gene pair features beyond alignment similarities combined with the advances in big data supervised classification.

Highlights

Orthologs are defined as genes in different species that descend by speciation from the same gene in the last common ancestor [1]
Focusing on the graph-based approach, orthogroups are generally built from the comparison of genome pairs by using BLAST searches [16] and the application of some “nearest neighbor” heuristics such as Best BLAST Hit (Bet) [2], Bidirectional Best Hit (BBH) [17], Reciprocal Best Hits (RBH) [18], Reciprocal Smallest Distance (RSD) [19], or Best Unambiguous Subset (BUS) [20] to find potential pairwise orthology relationships
For the evaluation of pairwise ortholog detection (POD) algorithms in related yeast genomes, in Experiment 1 we evaluated the algorithms inside a genome by partitioning at random 75% of the complete set of pairs for training and 25% for testing, and in Experiment 2 we built the model from a genome pair and tested it in two different pairs

Summary

Introduction

Orthologs are defined as genes in different species that descend by speciation from the same gene in the last common ancestor [1]. Focusing on the graph-based approach, orthogroups are generally built from the comparison of genome pairs by using BLAST searches [16] and the application of some “nearest neighbor” heuristics such as Best BLAST Hit (Bet) [2], Bidirectional Best Hit (BBH) [17], Reciprocal Best Hits (RBH) [18], Reciprocal Smallest Distance (RSD) [19], or Best Unambiguous Subset (BUS) [20] to find potential pairwise orthology relationships. Algorithms can return pairwise relationships, if they perform pairwise ortholog detection (POD) such as RBH [18] and RSD themselves [19], BioMed Research International and Comprehensive, Automated Project for the Identification of Orthologs from Complete Genome Data (OMA) Pairwise [21], or they can apply clustering to predict orthogroups from the score of the alignment process

Methods

Results

Conclusion