FASTA Herder: a web application to trim protein sequence sets

Caroline Louis-Jeune,Carol Perez-Iratxeta,Miguel A Andrade-Navarro

doi:10.14293/s2199-1006.1.sor-life.a67837.v2

Caroline Louis-Jeune, Carol Perez-Iratxeta + Show 1 more

Open Access

https://doi.org/10.14293/s2199-1006.1.sor-life.a67837.v2

Copy DOI

Abstract

Abstract The ever increasing number of sequences in protein databases usually turns out large numbers of homologs in sequence similarity searches. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. More generally, high redundancy reduces the usability of a protein set in machine learning applications and biases statistical analyses. We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds. Our method was fully tested and implemented in a web application called FASTA Herder, publicly available at http://fh.ogic.ca/.

Highlights

Multiple sequence alignment (MSA) remains the most important analytic tool to assess evolutionary relations between proteins and to determine the conserved regions of the sequence that usually harbor structural and functional properties
We developed an algorithm to identify redundant sequence homologs that can be culled producing a streamlined FASTA file
Removing highly redundant sequences in a set of homologs helps MSA and its interpretation as well as reducing biases in the protein set when it is taken as a sample for statistical analysis or for machine learning applications

Summary

Introduction

Multiple sequence alignment (MSA) remains the most important analytic tool to assess evolutionary relations between proteins and to determine the conserved regions of the sequence that usually harbor structural and functional properties. While information from homology can be very useful for functional prediction based on amino acid conservation, many of these homologs usually have high levels of identity among themselves, which hinders multiple sequence alignment computation and, especially, visualization. As a difference from other automatic approaches that only aggregate sequences with high identity, our method clusters near-full length homologs allowing for lower sequence identity thresholds.

Results

Conclusion