Abstract

BackgroundRemote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM).ResultsWe use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function.ConclusionsThe strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.

Highlights

  • Remote homology detection is a hard computational problem

  • We have demonstrated that good performance can be achieved when we used first-order logical representations for the protein sequences based on conserved amino acid positions and based on conserved physicochemical positions in the multiple sequence alignments (MSA)

  • We called Seq those models that are trained from sequential properties only, and we named Alncons those models that are trained from conserved amino acid positions in a MSA

Read more

Summary

Introduction

Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. An important problem in Computational Biology is the detection of remote homologous proteins, that is, proteins that have a common ancestor but that have diverged significantly in their primary sequence in evolutionary history. Remote homology detection is the problem of detecting homology in cases of low sequence identity, frequently below 30%. This is an important and hard problem, the development of methods to identify homologs between proteins is essential for functional and comparative genomics. Homology detection methods are today very important to help for sequence annotation and to guide laboratory experiments

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call