Abstract

O-GlcNAcylation is a post-translational modification widely found in proteins across phyla. It consists of the addition of β-N-acetylglucosamine to the hydroxy group of serine or threonine residues. O-GlcNAcylation modulates a myriad of biological processes with more than 5000 O-GlcNAcylated proteins reported in human to date. Since the loss of the dbOGA database, O-GlcNAc-specific data were dispersed across literature and non-O-GlcNAc-focus web databases, thus raising the need for a single integrated online resource. To this extent, we recently compiled and published the O-GlcNAc Database (www.oglcnac.mcw.edu), which provides the scientific community with a place to browse all the O-GlcNAcylation data published. For human, this O-GlcNAcome catalogue contains more than 5000 proteins and 7000 O-GlcNAcylation sites, each entry being provided with the relevant subset of references. In addition, we provide entry-specific protein-digest tools, as well as an advanced search mode to match large experimental datasets with the O-GlcNAc Database content. We offer download options for entry-specific data (proteins and literature items), entire datasets and graphical representations in several formats for general (CSV, XLSX, PDF, BIB) and programming-oriented (JSON) use. To ensure data consistency, we match each O-GlcNAc site against UniProtKB protein sequences (including isoforms) prior to integration in the Database. The review and integration of new literature items is accelerated by taking advantage of Natural Language Processing methods using Machine Learning. From our manually curated set of references, we generated a list of frequent expression patterns which allow us to describe each document with binary values, depending on the presence or absence of a given expression pattern. Then, we trained a Neural Network classifier which aims to predict, for each document, whether it contains O-GlcNAcylated protein-identification. Upon prediction on newly published O-GlcNAc articles, detailed reports are generated and include prediction results, document-specific keywords for species, proteins, O-GlcNAcylation sites and methods, as well as sentences containing those keywords. Overall, we developed a user-friendly interface within an integrated system mostly administrated through automated pipelines, thus saving human time and limiting interventions to validation and revision steps. We hope the O-GlcNAc Database will be both a useful resource for the field, and an inspiring framework for scientific developers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call