Abstract

We present three clustered protein sequence databases, Uniclust90, Uniclust50, Uniclust30 and three databases of multiple sequence alignments (MSAs), Uniboost10, Uniboost20 and Uniboost30, as a resource for protein sequence analysis, function prediction and sequence searches. The Uniclust databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity. Uniclust90 and Uniclust50 clusters showed better consistency of functional annotation than those of UniRef90 and UniRef50, owing to an optimised clustering pipeline that runs with our MMseqs2 software for fast and sensitive protein sequence searching and clustering. Uniclust sequences are annotated with matches to Pfam, SCOP domains, and proteins in the PDB, using our HHblits homology detection tool. Due to its high sensitivity, Uniclust contains 17% more Pfam domain annotations than UniProt. Uniboost MSAs of three diversities are built by enriching the Uniclust30 MSAs with local sequence matches from MMseqs2 profile searches through Uniclust30. All databases can be downloaded from the Uniclust server at uniclust.mmseqs.com. Users can search clusters by keywords and explore their MSAs, taxonomic representation, and annotations. Uniclust is updated every two months with the new UniProt release.

Highlights

  • The number of protein sequences in public databases such as UniProt [1] or GenBank [2] is growing fast, in part due to various large-scale genomics projects [3,4,5]

  • The popular UniProt Reference Clusters (UniRef) [9] consist of three databases that are generated by clustering the UniProtKB sequences in three steps using the CDHIT software [10]: UniRef100 combines identical UniProtKB sequences and fragments with 100% sequence identity into common entries

  • UniRef90 sequences are obtained by clustering UniRef100 sequences together that have at least 90% sequence identity and 80% sequence length overlap, and UniRef50 clusters together UniRef90 sequences with at least 50% sequence identity and 80% sequence length overlap

Read more

Summary

Introduction

The number of protein sequences in public databases such as UniProt [1] or GenBank [2] is growing fast, in part due to various large-scale genomics projects [3,4,5]. Apart from saving computational resources, the more even coverage of sequence space of such clustered databases can improve the sensitivity of sequence similarity searches [6,7,8]. The popular UniProt Reference Clusters (UniRef) [9] consist of three databases that are generated by clustering the UniProtKB sequences in three steps using the CDHIT software [10]: UniRef100 combines identical UniProtKB sequences and fragments with 100% sequence identity into common entries. We introduce the Uniclust sequence databases which, like UniRef, are clustered, representative sets of UniProtKB sequences at three different clustering levels. The following characteristics make Uniclust databases unique and useful: First, the sensitivity of MMseqs for distantly homologous sequences allows us to cluster the UniProtKB down to 30% sequence identity. Uniclust and Uniclust clusters show higher functional consistency scores than UniRef and UniRef at similar clustering depths, respectively. We provide deep annotation of Uniclust sequences with Pfam [11] and SCOP [12] domains, and matches to PDB sequences [13]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.