String kernels for protein sequence comparisons: improved fold recognition

Saghi Nojoomi,Patrice Koehl

doi:10.1186/s12859-017-1560-9

Saghi Nojoomi, Patrice Koehl

Open Access

https://doi.org/10.1186/s12859-017-1560-9

Copy DOI

Abstract

BackgroundThe amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity.ResultsIn this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments.ConclusionWe have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison.

Highlights

The amino acid sequence of a protein is the blueprint from which its structure and function can be derived
We address the problem of protein sequence comparison in the context of protein fold recognition, and show that a new string kernel drastically improves the latter compared to traditional methods based on sequence alignment
We propose to use a string kernel that provides an alignmentfree measure of the similarity of two protein sequences

Summary

Introduction

The amino acid sequence of a protein is the blueprint from which its structure and function can be derived. Amino acids are usually described using a one-letter code, and protein sequences are correspondingly represented as strings of letters This representation has proved very useful, especially in the context of sequence alignment [7, 8] that is usually performed using stringmatching algorithms [9]. They proceed in two steps, first the generation of the alignment between the two sequences, the derivation of a statistical score for that alignment They rely on a weighting scheme to measure the cost of matching pairs of amino acids. While those show improved sensitivity, they remain prone to the problems related to the construction and use of alignments

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC bioinformatics	Publication Date: Feb 28, 2017
Citations: 5	License type: open-access

R Discovery Prime

R Discovery Prime

String kernels for protein sequence comparisons: improved fold recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics

Lead the way for us

Similar Papers

Glossary
Fran Lewitter ... Janet M Thornton
Trends in Biotechnology | VOL. 16
Fran Lewitter, et. al.Fran Lewitter ... Janet M Thornton
01 Nov 1998
Trends in Biotechnology | VOL. 16

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

Comparison study on k-word statistical measures for protein: From sequence to 'sequence space'
Qi Dai ... Tianming Wang
BMC Bioinformatics | VOL. 9
Qi Dai, et. al.Qi Dai ... Tianming Wang
23 Sep 2008
BMC Bioinformatics | VOL. 9

Exploring Protein Sequence Space Using Computationally Directed Recombination

-

01 Jan 2006
01 Jan 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

String kernels for protein sequence comparisons: improved fold recognition

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC bioinformatics