Abstract

MotivationClassifying proteins into functional families can improve our understanding of protein function and can allow transferring annotations within one family. For this, functional families need to be ‘pure’, i.e., contain only proteins with identical function. Functional Families (FunFams) cluster proteins within CATH superfamilies into such groups of proteins sharing function. 11% of all FunFams (22 830 of 203 639) contain EC annotations and of those, 7% (1526 of 22 830) have inconsistent functional annotations.ResultsWe propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker). Using distances between embeddings and DBSCAN to cluster FunFams and identify outliers, doubled the number of pure clusters per FunFam compared to random clustering. Our approach was not limited to FunFams but also succeeded on families created using sequence similarity alone. Complementing EC annotations, we observed similar results for binding annotations. Thus, we expect an increased purity also for other aspects of function. Our results can help generating FunFams; the resulting clusters with improved functional consistency allow more reliable inference of annotations. We expect this approach to succeed equally for any other grouping of proteins by their phenotypes.Availability and implementationCode and embeddings are available via GitHub: https://github.com/Rostlab/FunFamsClustering.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

  • Introduction similar proteinsCATH Functional Families (FunFams) (Orengo, et al, 1997; Sillitoe, et al, 2020) provideKnowledge about the function of a protein is crucial for a wide array of a functional sub-classification of CATH superfamilies

  • We propose an approach to further cluster FunFams into functionally more consistent sub-families by encoding

  • 37 their sequences through embeddings. These embeddings originate from language models transferring knowledge gained from predicting missing amino acids in a sequence (ProtBERT) and have been further optimized to distinguish between proteins belonging to the same or a different CATH superfamily (PB-Tucker)

Read more

Summary

Results

The function of reduced the number of different EC annotations in a FunFam allowed two proteins is more similar, the more levels of their two EC numbers validating our new approach. Consider the following two FunFams: FF1 has several proteins all that this orthogonal perspective – using embedding rather than annotated with the same two EC numbers EC1 and EC2, while FF2 has 48 sequence space – might help to find functionally consistent sub-groups within protein families built using sequence similarity. We applied contrastive learning (Becker and Hinton, 1992; Bromley, et al, 1993; Le-Khac, et al, 2020) on the ProtBERT embeddings to computed the median over those distances for all FunFams in one superfamily and used this value for each FunFam. The resulting value learn a new embedding space which was optimized to increase the still reflects the expected distance between pairs, but the effect of large distance between CATH superfamilies and brings those within one distances due to impurity should be averaged out by considering all superfamily closer while pushing members of different superfamilies FunFams in a superfamily. Pure FunFams as the percentage of FunFams with a purity of 100

Clustering
Embedding clusters increased EC purity
Similar improvement for single domain proteins
Details of parameter choices mattered
Slightly worse results for experimentally verified
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call