CATH: increased structural coverage of functional space.

Ian Sillitoe,Karel Berka,Harry M Scholes,Ivana Hutařová Varekova,Natalie Dawson,Camilla S M Pang,Laurel Woodridge,Jon Lees,Mahnaz Abbasian,Su Datt Lam,Vaishali P Waman,Nicola Bordin,Radka Svobodova,Clemens Rauer,Paul Ashford,Christine A Orengo,Sean Le Cornu,Neeladri Sen

doi:10.1093/nar/gkaa1079

Abstract

CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.

Highlights

The CATH database, originally developed in 1997 [1], provides an up-to-date and systematic structural classification of protein 3D structures and is one of the Core Data Resources within ELIXIR, a major European distributed infrastructure for life-science information
To obtain this predicted domain data we use a set of representative structural domains to ‘seed’ a set of protein sequence alignments, which are converted into hidden Markov models (HMMs)
Our new release of CATH represents a significant expansion in both structure (15% increase) and sequence (59% increase)

Summary

Introduction

The CATH database, originally developed in 1997 [1], provides an up-to-date and systematic structural classification of protein 3D structures and is one of the Core Data Resources within ELIXIR, a major European distributed infrastructure for life-science information. Both CATH and Gene3D provide comprehensive structural domain assignments and functional annotation for protein sequences from major protein sequence databases such as UniProt and Ensembl [5,6]. CATH provides structural superpositions of all representative protein domains using an in-house structure and sequence alignment program (SSAP) [7].

Results

Conclusion