SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

J P Koskinen,L Holm

doi:10.1093/bioinformatics/bts417

J P Koskinen, L Holm

Open Access

PDF Available

https://doi.org/10.1093/bioinformatics/bts417

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50–100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.Availability and implementation: The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.Contact: liisa.holm@helsinki.fi.

Highlights

The performance of suffix array neighborhood search (SANS) is close to the performance of USEARCH SANS is 10 times faster and USEARCH has the advantage of multiple testing
We have investigated the use of word filters to speed up protein sequence database searches
The principal conclusions from our extensive benchmarking can be summarized as follows: 1. word filters are as sensitive as BLAST in the feasible regime of 50–100% sequence identity, 2. many variants of word filters perform about well in the feasible regime but SANS is the most robust to parameter variation, 3. suffix array supports the fastest known word filter algorithm, 4. methods incorporating explicit alignment are necessary

Summary

SYSTEM AND METHODS

2.1 Protein datasets We selected real datasets to get a realistic distribution of protein lengths, composition and protein family sizes (Table 2). Uniprot is the major collection of protein sequences. It consists of two parts, swissprot and trembl. Trembl contains protein sequences that are translated from nucleotide sequences and automatically annotated. The metagenome dataset is a collection of proteins detected in environmental samples and was downloaded from NCBI (env_nr). Metagenomic sequences typically come from uncultured organisms that are not present in the protein databases

Evaluation

Database search programs

Database indexing

Sequence comparison

Complexity analysis

Greedy approximate alignment

IMPLEMENTATION

RESULTS AND DISCUSSION

Metagenome benchmark

Genome benchmark

Memory

Conclusions

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: Sep 3, 2012
Citations: 23	License type: CC BY 3.0

R Discovery Prime

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Convolving Engineering and Medical Pedagogies for Training of Tomorrow's Health Care Professionals
R C Lee
IEEE Transactions on Biomedical Engineering | VOL. 60
R C LeeR C Lee
01 Mar 2013
IEEE Transactions on Biomedical Engineering | VOL. 60

Perspective: The Fundamental Value of Engineering Pedagogy for Realizing Personalized Medicine
Melissa L Kemp ... Raphael C Lee
Regenerative Engineering and Translational Medicine | VOL. 3
Melissa L Kemp, et. al.Melissa L Kemp ... Raphael C Lee
01 Dec 2017
Regenerative Engineering and Translational Medicine | VOL. 3

Iatrogenous Pneumothorax
Rolando Berger
Chest | VOL. 105
Rolando BergerRolando Berger
01 Apr 1994
Chest | VOL. 105

The Role of Epidemiology in the Era of Molecular Epidemiology and Genomics: Summary of the 2013 AJE-sponsored Society of Epidemiologic Research Symposium
L H Kuller ... R L Prentice
American Journal of Epidemiology | VOL. 178
L H Kuller, et. al.L H Kuller ... R L Prentice
08 Oct 2013
American Journal of Epidemiology | VOL. 178

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Bioinformatics