Recombination spot identification Based on gapped k-mers.

Rong Wang,Bin Liu,Yong Xu

doi:10.1038/srep23934

Rong Wang, Bin Liu + Show 1 more

Open Access

https://doi.org/10.1038/srep23934

Copy DOI

Abstract

Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. Accurate identification of recombination spots is useful for DNA function study. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. The k-mer feature is one of the most useful features for modeling the properties and function of DNA sequences. However, it suffers from the inherent limitation. If the value of word length k is large, the occurrences of k-mers are closed to a binary variable, with a few k-mers present once and most k-mers are absent. This usually causes the sparse problem and reduces the classification accuracy. To solve this problem, we add gaps into k-mer and introduce a new feature called gapped k-mer (GKM) for identification of recombination spots. By using this feature, we present a new predictor called SVM-GKM, which combines the gapped k-mers and Support Vector Machine (SVM) for recombination spot identification. Experimental results on a widely used benchmark dataset show that SVM-GKM outperforms other highly related predictors. Therefore, SVM-GKM would be a powerful predictor for computational genomics.

Highlights

Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity
Recombination plays an important role in genetic evolution, which describes the exchange of genetic information during the period of each generation in diploid organisms[1]
Recombination provides many new combinations of genetic variations and is an important source for biodiversity[2,3,4], which can accelerate the procedure of biological evolution

Summary

Introduction

Recombination is crucial for biological evolution, which provides many new combinations of genetic diversity. To improve the prediction accuracy, researchers have proposed several computational methods for recombination spot identification. Liu et al.[17] have exploited quadratic discriminant analysis to predict hot or cold spots These methods only consider the local sequence composition information, and ignore all the long-range or global sequence-order effects. Some computational predictors employ these features, and achieve better performance All these computational methods could yield quite encouraging results, and each of them did play a role in stimulating the development of recombination spot identification. In order to find a tradeoff between the sparse feature space problem and more sequence composition information, the gapped k-mer has been proposed, and successfully applied to enhancer identification[33,34]. Correspondence and requests for materials should be addressed to B.L. (email: bliu@ insun.hit.edu.cn) www.nature.com/scientificreports/

Methods

Results

Conclusion