Predicting Functional Surface Topographies Combining Topological Data Analysis and Deep Learning Across the Human Protein Universe.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Characterizing geometric and topological properties of protein structures encompassing surface pockets, interior cavities, and cross channels is important for understanding their functions. Our knowledge of protein structures has been greatly advanced by AI-powered structure prediction tools, with AlphaFold2 (AF2) providing accurate 3D structure predictions for most protein sequences. Nonetheless, there is a substantial lack of function annotations and corresponding functional surface topographical information. We develop a method to predict functional pockets, along with their associated Gene Ontology (GO) terms and Enzyme Commission (EC) numbers, for a set of 65,013 AF2-predicted human non-singleton representative structures, which can be mapped to 186,095 "non-fragment" AF2-predicted human protein structures. The identification of functional pockets, along with their respective GO terms and EC numbers, is achieved by combining topological data analysis and the deep learning method of DeepFRI. All predicted functional pockets for these 65,013 AF2-predicted human representative structures are accessible at: https://cfold.bme.uic.edu/castpfold.

Similar Papers
  • Research Article
  • 10.1093/bioinformatics/btad402
LEGO-CSM: a tool for functional characterization of proteins
  • Jun 29, 2023
  • Bioinformatics
  • Thanh Binh Nguyen + 4 more

MotivationWith the development of sequencing techniques, the discovery of new proteins significantly exceeds the human capacity and resources for experimentally characterizing protein functions. Localization, EC numbers, and GO terms with the structure-based Cutoff Scanning Matrix (LEGO-CSM) is a comprehensive web-based resource that fills this gap by leveraging the well-established and robust graph-based signatures to supervised learning models using both protein sequence and structure information to accurately model protein function in terms of Subcellular Localization, Enzyme Commission (EC) numbers, and Gene Ontology (GO) terms.ResultsWe show our models perform as well as or better than alternative approaches, achieving area under the receiver operating characteristic curve of up to 0.93 for subcellular localization, up to 0.93 for EC, and up to 0.81 for GO terms on independent blind tests.Availability and implementationLEGO-CSM’s web server is freely available at https://biosig.lab.uq.edu.au/lego_csm. In addition, all datasets used to train and test LEGO-CSM’s models can be downloaded at https://biosig.lab.uq.edu.au/lego_csm/data.

  • Research Article
  • 10.4028/www.scientific.net/amm.421.277
Identification and Analysis of Single- and Multiple-Region Mitotic Protein Complexes by Grouping Gene Ontology Terms
  • Sep 11, 2013
  • Applied Mechanics and Materials
  • Wen Lin Huang + 3 more

Many mitotic proteins are assembled into protein super complexes in three regions - midbody, centrosome and kinetochore (MCK) - with distinctive roles in modulating the mitosis process. However, more than 16% of the mitotic proteins are in multiple regions. Advance identification of mitotic proteins will be helpful to realize the molecular regulatory mechanisms of this organelle. Few ensemble-classifier methods can solve this problem but these methods often fuse various complementary features. In which, Gene ontology (GO) terms play an important role but the GO-term search space is massive and sparse. This motives this work to present an easily implemented method, namely mMck-GO, by identifying a small number of GO terms with support vector machine (SVM) andk-nearest neighbor (KNN) in predicting single-and multiple-region MCK proteins. The mMck-GO method using a simple grouping scheme based on a SVM classifier assembles the GO terms into several groups according to their numbers of annotated proteins in the training dataset, and then measures which top-grouped GO terms performs the best. A new MCK protein dataset containing 701 (611 single-and 90 multiple-region) is established in this work. None of the MCK proteins has a 25% pair-wise sequence identity with any other proteins in the same region. When performing on this dataset, we find that the GO term with the maximum annotation number annotates 49.2% of the training protein sequences; contrarily, 56.5% of the GO terms annotate single one protein sequence. This shows the sparse character of GO terms and the effectiveness of top-grouped GO terms in distinguishing MCK proteins. Accordingly, a small group of top 134 GO terms is identified and mMck-GO fuses the GO terms with amino acid composition (AAC) as input features to yield and independent-testing accuracies of 71.66% and 69.18%, respectively. Top 30 GO terms contain eight, eight, and 14 GO terms belonging to molecular function, biological process and cellular component branches, respectively. The 14 GO terms in cellular-component ontology in addition to centrosome and kinetochore are reverent to subcellular compartments, microtubule, membrane, and spindle, where GO:0005737 (cytoplasm) is ranked first. The eight GO terms enabling molecular functions comprise GO:0005515 (protein binding), GO:0000166 (nucleotide binding), and GO:0005524 (ATP binding). Most of the eight GO terms in biological-process ontology are reverent to cell cycle, cell division and mitosis but two GO terms, GO:0045449 and GO:0045449, are reverent to regulation of transcription and transport processes, which helps us to clarify the molecular regulatory mechanisms of this organelle. The top-grouped GO terms can be as an indispensable feature set when concerning other feature types to solve multiple-class problems in the investigation of biological functions.

  • Research Article
  • Cite Count Icon 7
  • 10.14806/ej.18.b.540
Answering Gene Ontology terms to proteomics questions by supervised macro reading in Medline
  • Nov 9, 2012
  • EMBnet.journal
  • Julien Gobeill + 4 more

Motivation and Objectives Biomedical professionals have at their disposal a huge amount of literature. But when they have a precise question, they often have to deal with too many documents to efficiently find the appropriate answers in a reasonable time. Faced to this literature overload, the need for automatic assistance has been largely pointed out, and PubMed is argued to be only the beginning on how scientists use the biomedical literature (Hunter and Cohen, 2006). Ontology-based search engines began to introduce semantics in search results. These systems still display documents, but the user visualizes clusters of PubMed results according to concepts which were extracted from the abstracts. GoPubMed (Doms and Schroeder, 2005) and EBIMed (Rebholz-Schuhmann et al, 2007) are popular examples of such ontology-based search engines in the biomedical domain. Question Answering (QA) systems are argued to be the next generation of semantic search engines (Wren, 2011). QA systems no more display documents but directly concepts which were extracted from the search results; these concepts are supposed to answer the user’s question formulated in natural language. EAGLi (Gobeill et al, 2009), our locally developed system, is an example of such QA search engines. Thus, both ontology-based and QA search engines, share the crucial task of efficiently extracting concepts from the result set, i.e. a set of documents. This task is sometimes called macro reading, in contrast with micro reading – or classification, categorization – which is a traditional Natural Language Processing task that aims at extracting concepts from a single document (Mitchell et al, 2009). This paper focuses on macro reading of MEDLINE abstracts. Several experiments have been reported to find the best way to extract ontology terms out of a single MEDLINE abstract, i.e. micro reading. In particular, (Trieschnigg et al, 2009) compared the performances of six classification systems for reproducing the manual Medical Subject Headings (MeSH) annotation of a MEDLINE abstract. The evaluated systems included two morphosyntactic classifiers (sometimes also called thesaurus-based), which aim at literally finding ontology terms in the abstract by alignment of words, and a machine learning (or supervised) classifier, which aims at inferring the annotation from a knowledge base containing already annotated abstracts. The authors concluded that the machine learning approach outperformed the morphosyntactic ones. But the macro reading task is fundamentally different, as we look for the best way to extract then combine ontology terms from a set of MEDLINE abstracts. The issue investigated in this paper is: to what extent the differences observed between two classifiers for a micro reading task are observed for a macro reading one? In particular, the redundancy hypothesis claims that the redundancy in large textual collections such as the Web or MEDLINE tends to smoothe performance differences across classifiers (Lin, 2007). To address this question, we compared a morphosyntactic and a machine learning classifiers for both tasks, focusing on the extraction of Gene Ontology (GO) terms, a controlled vocabulary for the characterization of proteins functions. The micro reading task consisted in extracting GO terms from a single MEDLINE abstract, as in the Trieschnigg et al’s work; the macro reading task consisted in extracting GO terms from a set of MEDLINE abstracts in order to answer to proteomics questions asked to the EAGLi QA system.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 31
  • 10.1186/1471-2105-9-52
The development of PIPA: an integrated and automated pipeline for genome-wide protein function annotation
  • Jan 25, 2008
  • BMC Bioinformatics
  • Chenggang Yu + 5 more

BackgroundAutomated protein function prediction methods are needed to keep pace with high-throughput sequencing. With the existence of many programs and databases for inferring different protein functions, a pipeline that properly integrates these resources will benefit from the advantages of each method. However, integrated systems usually do not provide mechanisms to generate customized databases to predict particular protein functions. Here, we describe a tool termed PIPA (Pipeline for Protein Annotation) that has these capabilities.ResultsPIPA annotates protein functions by combining the results of multiple programs and databases, such as InterPro and the Conserved Domains Database, into common Gene Ontology (GO) terms. The major algorithms implemented in PIPA are: (1) a profile database generation algorithm, which generates customized profile databases to predict particular protein functions, (2) an automated ontology mapping generation algorithm, which maps various classification schemes into GO, and (3) a consensus algorithm to reconcile annotations from the integrated programs and databases.PIPA's profile generation algorithm is employed to construct the enzyme profile database CatFam, which predicts catalytic functions described by Enzyme Commission (EC) numbers. Validation tests show that CatFam yields average recall and precision larger than 95.0%. CatFam is integrated with PIPA.We use an association rule mining algorithm to automatically generate mappings between terms of two ontologies from annotated sample proteins. Incorporating the ontologies' hierarchical topology into the algorithm increases the number of generated mappings. In particular, it generates 40.0% additional mappings from the Clusters of Orthologous Groups (COG) to EC numbers and a six-fold increase in mappings from COG to GO terms. The mappings to EC numbers show a very high precision (99.8%) and recall (96.6%), while the mappings to GO terms show moderate precision (80.0%) and low recall (33.0%).Our consensus algorithm for GO annotation is based on the computation and propagation of likelihood scores associated with GO terms. The test results suggest that, for a given recall, the application of the consensus algorithm yields higher precision than when consensus is not used.ConclusionThe algorithms implemented in PIPA provide automated genome-wide protein function annotation based on reconciled predictions from multiple resources.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-45385-5_24
Graph Based Automatic Protein Function Annotation Improved by Semantic Similarity
  • Jan 1, 2020
  • Bishnu Sarker + 3 more

Functional annotation of protein is a very challenging task primarily because manual annotation requires a great amount of human efforts and still it’s nearly impossible to keep pace with the exponentially growing number of protein sequences coming into the public databases, thanks to the high throughput sequencing technology. For example, the UniProt Knowledge-base (UniProtKB) is currently the largest and most comprehensive resource for protein sequence and annotation data. According to the November, 2019 release of UniProtKB, some 561,000 sequences are manually reviewed but over 150 million sequences lack reviewed functional annotations. Moreover, it is an expensive deal in terms of the cost it incurs and the time it takes. On the contrary, exploiting this huge quantity of data is important to understand life at the molecular level, and is central to understanding human disease processes and drug discovery. To be useful, protein sequences need to be annotated with functional properties such as Enzyme Commission (EC) numbers and Gene Ontology (GO) terms. The ability to automatically annotate protein sequences in UniProtKB/TrEMBL, the non-reviewed UniProt sequence repository, would represent a major step towards bridging the gap between annotated and un-annotated protein sequences. In this paper, we extend a neighborhood based network inference technique for automatic GO annotation using protein similarity graph built on protein domain and family information. The underlying philosophy of our approach assumes that proteins can be linked through the domains, families, and superfamilies that they share. We propose an efficient pruning and post-processing technique by integrating semantic similarity of GO terms. We show by empirical results that the proposed hierarchical post-processing potentially improves the performance of other GO annotation tools as well.KeywordsGraph miningBioinformaticsKnowledge discoveryProtein function annotationNetwork inferenceGrAPFI

  • Peer Review Report
  • 10.7554/elife.86504.sa2
Author response: A single-cell transcriptome atlas of pig skin characterizes anatomical positional heterogeneity
  • May 11, 2023
  • Rong Yuan + 17 more

Author response: A single-cell transcriptome atlas of pig skin characterizes anatomical positional heterogeneity

  • Research Article
  • Cite Count Icon 163
  • 10.1038/msb4100043
Global analysis of gene function in yeast by quantitative phenotypic profiling
  • Jan 1, 2006
  • Molecular Systems Biology
  • James A Brown + 8 more

We present a method for the global analysis of the function of genes in budding yeast based on hierarchical clustering of the quantitative sensitivity profiles of the 4756 strains with individual homozygous deletion of nonessential genes to a broad range of cytotoxic or cytostatic agents. This method is superior to other global methods of identifying the function of genes involved in the various DNA repair and damage checkpoint pathways as well as other interrogated functions. Analysis of the phenotypic profiles of the 51 diverse treatments places a total of 860 genes of unknown function in clusters with genes of known function. We demonstrate that this can not only identify the function of unknown genes but can also suggest the mechanism of action of the agents used. This method will be useful when used alone and in conjunction with other global approaches to identify gene function in yeast.

  • Research Article
  • Cite Count Icon 29
  • 10.1093/nar/gkae415
CASTpFold: Computed Atlas of Surface Topography of the universe of protein Folds.
  • May 23, 2024
  • Nucleic acids research
  • Bowei Ye + 3 more

Geometric and topological properties of protein structures, including surface pockets, interior cavitiesand cross channels, are of fundamental importance for proteins to carry out their functions. Computed Atlas of Surface Topography of proteins(CASTp) is a widely used web server for locating, delineating, and measuring these geometric and topological properties of protein structures. Recent developments in AI-based protein structure prediction such as AlphaFold2 (AF2) have significantly expanded our knowledge on protein structures. Here we present CASTpFold, a continuation of CASTp that provides accurate and comprehensive identifications and quantifications of protein topography. It now provides (i) results on an expanded database of proteins, including the Protein Data Bank(PDB) and non-singleton representative structures of AlphaFold2 structures, covering 183 million AF2 structures; (ii) functional pockets prediction with corresponding Gene Ontology (GO) terms or Enzyme Commission (EC) numbers for AF2-predicted structuresand (iii) pocket similarity search function for surface and protein-protein interface pockets. The CASTpFold web server is freely accessible at https://cfold.bme.uic.edu/castpfold/.

  • Research Article
  • 10.1016/j.bbapap.2023.140985
FunPredCATH: An ensemble method for predicting protein function using CATH
  • Dec 19, 2023
  • BBA - Proteins and Proteomics
  • Joseph Bonello + 1 more

FunPredCATH: An ensemble method for predicting protein function using CATH

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 17
  • 10.1186/1471-2105-11-215
PoGO: Prediction of Gene Ontology terms for fungal proteins
  • Apr 29, 2010
  • BMC Bioinformatics
  • Jaehee Jung + 3 more

BackgroundAutomated protein function prediction methods are the only practical approach for assigning functions to genes obtained from model organisms. Many of the previously reported function annotation methods are of limited utility for fungal protein annotation. They are often trained only to one species, are not available for high-volume data processing, or require the use of data derived by experiments such as microarray analysis. To meet the increasing need for high throughput, automated annotation of fungal genomes, we have developed a tool for annotating fungal protein sequences with terms from the Gene Ontology.ResultsWe describe a classifier called PoGO (Prediction of Gene Ontology terms) that uses statistical pattern recognition methods to assign Gene Ontology (GO) terms to proteins from filamentous fungi. PoGO is organized as a meta-classifier in which each evidence source (sequence similarity, protein domains, protein structure and biochemical properties) is used to train independent base-level classifiers. The outputs of the base classifiers are used to train a meta-classifier, which provides the final assignment of GO terms. An independent classifier is trained for each GO term, making the system amenable to updating, without having to re-train the whole system. The resulting system is robust. It provides better accuracy and can assign GO terms to a higher percentage of unannotated protein sequences than other methods that we tested.ConclusionsOur annotation system overcomes many of the shortcomings that we found in other methods. We also provide a web server where users can submit protein sequences to be annotated.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.ygeno.2022.110528
GOCompare: An R package to compare functional enrichment analysis between two species
  • Nov 30, 2022
  • Genomics
  • Chrystian C Sosa + 7 more

GOCompare: An R package to compare functional enrichment analysis between two species

  • Research Article
  • Cite Count Icon 4
  • 10.1021/acs.jproteome.0c00482
Bioinformatic Prediction of Gene Ontology Terms of Uncharacterized Proteins from Chromosome 11
  • Oct 22, 2020
  • Journal of Proteome Research
  • Heeyoun Hwang + 7 more

In chromosome 11, 71 out of its 1254 proteins remain functionally uncharacterized on the basis of their existence evidence (uPE1s) following the latest version of neXtProt (release 2020-01-17). Because in vivo and in vitro experimental strategies are often time-consuming and labor-intensive, there is a need for a bioinformatics tool to predict the function annotation. Here, we used I-TASSER/COFACTOR provided on the neXtProt web site, which predicts gene ontology (GO) terms based on the 3D structure of the protein. I-TASSER/COFACTOR predicted 2413 GO terms with a benchmark dataset of the 22 proteins belonging to PE1 of chromosome 11. In this study, we developed a filtering algorithm in order to select specific GO terms using the GO map generated by I-TASSER/COFACTOR. As a result, 187 specific GO terms showed a higher average precision-recall score at the least cellular component term compared to 2413 predicted GO terms. Next, we applied 65 proteins belonging to uPE1s of chromosome 11, and then 409 out of 6684 GO terms survived, where 103 and 142 GO terms of molecular function and biological process, respectively, were included. Representatively, the cellular component GO terms of CCDC90B, C11orf52, and the SMAP were predicted and validated using the overexpression system into 293T cells and immunofluorescence staining. We will further study their biological and molecular functions toward the goal of the neXt-CP50 project as a part of C-HPP. We shared all results and programs in Github (https://github.com/heeyounh/I-TASSER-COFACTOR-filtering.git).

  • Research Article
  • Cite Count Icon 36
  • 10.1110/ps.062158406
New avenues in protein function prediction
  • Jun 1, 2006
  • Protein Science
  • Iddo Friedberg + 2 more

New avenues in protein function prediction

  • Research Article
  • Cite Count Icon 29
  • 10.1093/nar/gkn374
Associating transcription factor-binding site motifs with target GO terms and target genes
  • Jun 10, 2008
  • Nucleic Acids Research
  • Mikael Bodén + 1 more

The roles and target genes of many transcription factors (TFs) are still unknown. To predict the roles of TFs, we present a computational method for associating Gene Ontology (GO) terms with TF-binding motifs. The method works by ranking all genes as potential targets of the TF, and reporting GO terms that are significantly associated with highly ranked genes. We also present an approach, whereby these predicted GO terms can be used to improve predictions of TF target genes. This uses a novel gene-scoring function that reflects the insight that genes annotated with GO terms predicted to be associated with the TF are more likely to be its targets. We construct validation sets of GO terms highly associated with known targets of various yeast and human TF. On the yeast reference sets, our prediction method identifies at least one correct GO term for 73% of the TF, 49% of the correct GO terms are predicted and almost one-third of the predicted GO terms are correct. Results on human reference sets are similarly encouraging. Validation of our target gene prediction method shows that its accuracy exceeds that of simple motif scanning.

  • Research Article
  • Cite Count Icon 13
  • 10.1016/j.jtbi.2012.07.027
Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes
  • Aug 7, 2012
  • Journal of Theoretical Biology
  • Wen-Lin Huang

Ranking Gene Ontology terms for predicting non-classical secretory proteins in eukaryotes and prokaryotes

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon