WildSpan: mining structured motifs from protein sequences

Chen-Ming Hsu,Chien-Yu Chen,Baw-Jhiune Liu

doi:10.1186/1748-7188-6-6

Chen-Ming Hsu, Chien-Yu Chen + Show 1 more

Open Access

https://doi.org/10.1186/1748-7188-6-6

Copy DOI

Abstract

BackgroundAutomatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost.ResultsWildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for discovering functional regions of a single protein by referring to a set of related sequences (e.g. its homologues). The discovered W-patterns are used to characterize the protein sequence and the results are compared with the conserved positions identified by multiple sequence alignment (MSA). The family-based mining mode of WildSpan is developed for extracting sequence signatures for a group of related proteins (e.g. a protein family) for protein function classification. In this situation, the discovered W-patterns are compared with PROSITE patterns as well as the patterns generated by three existing methods performing the similar task. Finally, analysis on execution time of running WildSpan reveals that the proposed pruning strategy is effective in improving the scalability of the proposed algorithm.ConclusionsThe mining results conducted in this study reveal that WildSpan is efficient and effective in discovering functional signatures of proteins directly from sequences. The proposed pruning strategy is effective in improving the scalability of WildSpan. It is demonstrated in this study that the W-patterns discovered by WildSpan provides useful information in characterizing protein sequences. The WildSpan executable and open source codes are available on the web (http://biominer.csie.cyu.edu.tw/wildspan).

Highlights

Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology
We conduct experiments on a proteinprotein docking benchmark [27] for evaluating the performance of the protein-based mining mode of WildSpan in identifying functionally important regions of proteins. By this dataset we demonstrate that WildSpan is capable of identifying sequence motifs that usually contribute to forming local structures of proteins and are related to functional interfaces
Discovering W-patterns is important in analyzing protein sequences because protein functional motifs are usually composed of many conserved blocks that are separated in primary sequences but are often close to each other in 3-D structures

Summary

Introduction

Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. We previously employed motif finding in a hybrid way: detecting functional regions of a novel sequence directly by mining its sequence along with a set of homologues found in sequence database (MAGIIC-PRO, [8]). Similar to multiple sequence alignment (MSA), MAGIIC-PRO can be invoked as long as the query protein can find sufficient homologues from databases (this can be achieved after the completion of abundant sequencing projects). In this way, functional residues of the query protein can be predicted even when the function of the collected homologues is still left unknown.

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms for Molecular Biology	Publication Date: Mar 31, 2011
Citations: 42	License type: cc-by

R Discovery Prime

R Discovery Prime

WildSpan: mining structured motifs from protein sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology

Lead the way for us

Similar Papers

Heuristic Methods for Finding Pathogenic Variants in Gene Coding Sequences
Monique Ohanian ... Diane Fatkin
Journal of the American Heart Association | VOL. 1
Monique Ohanian, et. al.Monique Ohanian ... Diane Fatkin
26 Sep 2012
Journal of the American Heart Association | VOL. 1

Mining high utility patterns in interval-based event sequences
S Mohammad Mirbagheri ... Howard J Hamilton
Data & Knowledge Engineering | VOL. 135
S Mohammad Mirbagheri, et. al.S Mohammad Mirbagheri ... Howard J Hamilton
27 Aug 2021
Data & Knowledge Engineering | VOL. 135

Interpretable Learning and Pattern Mining: Scalable Algorithms and Data-Driven Applications

-

10 Jul 2020
10 Jul 2020

Learning to Read and Write in the Language of Proteins
Helen T Hobbs ... Chang C Liu
GEN Biotechnology | VOL. 2
Helen T Hobbs, et. al.Helen T Hobbs ... Chang C Liu
01 Apr 2023
GEN Biotechnology | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

WildSpan: mining structured motifs from protein sequences

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms for Molecular Biology