Optimal neighborhood indexing for protein similarity search

Pierre Peterlongo,Laurent Noé,Dominique Lavenier,Van Hoa Nguyen,Mathieu Giraud,Gregory Kucherov

doi:10.1186/1471-2105-9-534

Abstract

BackgroundSimilarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet.ResultsThe paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website .ConclusionWe propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.

Highlights

Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data
We focus on massive protein sequence comparisons: a large database is iteratively compared with relatively short queries
The main result of our work is an effective reduction of the index size without deteriorating the quality of the results of similarity search

Summary

Introduction

Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. This improvement leads to a larger index that may become a bottleneck. One fundamental task in bioinformatics concerns large scale comparisons between proteins or families of proteins. It often constitutes the first step before further investigations. We focus on massive protein sequence comparisons: a large database is iteratively compared with relatively short queries (such as newly sequenced data).

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Dec 1, 2008
Citations: 24	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Optimal neighborhood indexing for protein similarity search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

A Space-Efficient Approach towards Distantly Homologous Protein Similarity Searches
...
International Journal of Advanced Research in Computer Science | VOL. 6
, et. al. ...
25 Aug 2015
International Journal of Advanced Research in Computer Science | VOL. 6

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices
Oguz Selvitopi ... Aydin Buluc
-
Oguz Selvitopi, et. al.Oguz Selvitopi ... Aydin Buluc
01 Nov 2020
01 Nov 2020

Structure-Function Analysis of Escherichia coli DNA Helicase I Reveals Non-overlapping Transesterase and Helicase Domains
Devon R Byrd ... Steven W Matson
Journal of Biological Chemistry | VOL. 277
Devon R Byrd, et. al.Devon R Byrd ... Steven W Matson
01 Nov 2002
Journal of Biological Chemistry | VOL. 277

Alphabet reduction and distributed vector representation based method for classification of antimicrobial peptides
Shraddha Surana ... Jayaraman Valadi
-
Shraddha Surana, et. al.Shraddha Surana ... Jayaraman Valadi
16 Dec 2020
16 Dec 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Optimal neighborhood indexing for protein similarity search

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics