Protein sequence-similarity search acceleration using a heuristic algorithm with a sensitive matrix.

Kyungtaek Lim,Martin C Frith,Kentaro Tomii,Kazunori D Yamada

doi:10.1007/s10969-016-9210-4

Abstract

Protein database search for public databases is a fundamental step in the target selection of proteins in structural and functional genomics and also for inferring protein structure, function, and evolution. Most database search methods employ amino acid substitution matrices to score amino acid pairs. The choice of substitution matrix strongly affects homology detection performance. We earlier proposed a substitution matrix named MIQS that was optimized for distant protein homology search. Herein we further evaluate MIQS in combination with LAST, a heuristic and fast database search tool with a tunable sensitivity parameter m, where larger m denotes higher sensitivity. Results show that MIQS substantially improves the homology detection and alignment quality performance of LAST across diverse m parameters. Against a protein database consisting of approximately 15 million sequences, LAST with m = 105 achieves better homology detection performance than BLASTP, and completes the search 20 times faster. Compared to the most sensitive existing methods being used today, CS-BLAST and SSEARCH, LAST with MIQS and m = 106 shows comparable homology detection performance at 2.0 and 3.9 times greater speed, respectively. Results demonstrate that MIQS-powered LAST is a time-efficient method for sensitive and accurate homology search.Electronic supplementary materialThe online version of this article (doi:10.1007/s10969-016-9210-4) contains supplementary material, which is available to authorized users.

Highlights

Protein homologs are likely to have similar structures, performing similar functions
We demonstrated that its application to SSEARCH achieved the highest level of homology detection performance among pairwise aligners [13]
In this study, by application of MIQS to LAST with variation of the m parameter as a first trial, we demonstrate that it can achieve faster searching than rigorous dynamic programming methods, while maintaining comparable sensitivity

Summary

Introduction

Protein homologs are likely to have similar structures, performing similar functions. Searching for protein homologs with known structures and functions is generally the first and most important step for selecting proteins for study and sample production, and for target selection in the field of structural and functional genomics. It is a necessary task for biological and functional annotation in modern biology. Database search methods such as BLASTP [1] and SSEARCH [2] have been widely used for this purpose. Amino acids are classifiable based on chemical properties stemming from their side chains, suggesting that substitutions between amino acid pairs occur at distinct

Methods

Results

Conclusion