Annotation of the Giardia proteome through structure-based homology and machine learning.

Brendan R E Ansell,Bernard J Pope,Peter Georgeson,Samantha J Emery-Corbin,Aaron R Jex

doi:10.1093/gigascience/giy150

Abstract

BackgroundLarge-scale computational prediction of protein structures represents a cost-effective alternative to empirical structure determination with particular promise for non-model organisms and neglected pathogens. Conventional sequence-based tools are insufficient to annotate the genomes of such divergent biological systems. Conversely, protein structure tolerates substantial variation in primary amino acid sequence and is thus a robust indicator of biochemical function. Structural proteomics is poised to become a standard part of pathogen genomics research; however, informatic methods are now required to assign confidence in large volumes of predicted structures.AimsOur aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination.MethodsWe used the I-TASSER suite to predict structural models for ∼5,000 proteins encoded in G. duodenalis and identify their closest empirically-determined structural homologues in the Protein Data Bank. Models were assigned to high- or lower-confidence categories depending on the presence of matching protein family (Pfam) domains in query and reference peptides. Metrics output from the suite and derived metrics were assessed for their ability to predict the high-confidence category individually, and in combination through development of a random forest classifier.ResultsWe identified 1,095 high-confidence models including 212 hypothetical proteins. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; however, the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions. High-confidence models exhibited greater transcriptional abundance, and the classifier generalized across species, indicating the broad utility of this approach for automatically stratifying predicted structures. Additional structure-based clustering was used to cross-check confidence predictions in an expanded family of Nek kinases. Several high-confidence-like proteins yielded substantial new insight into mechanisms of redox balance in G. duodenalis—a system central to the efficacy of limited anti-giardial drugs.ConclusionStructural proteomics combined with machine learning can aid genome annotation for genetically divergent organisms, including human pathogens, and stratify predicted structures to promote efficient allocation of limited resources for experimental investigation.

Highlights

Giardia duodenalis is a microaerophilic, parasitic protist that causes diarrheal disease in 200–300 million people annually
With the aim of obviating the need for additional informatics analysis after largescale structure prediction, we investigate the power of individual I-TASSER output metrics to correctly assign models as HC or LC and develop a random forest (RF) classifier that successfully predicts these categories and provides a more sensitive, continuous confidence score
We assigned high confidence in structural and functional information predicted for query peptides when at least one protein family (Pfam) code matched across query and reference peptides

Summary

Introduction

Giardia duodenalis is a microaerophilic, parasitic protist that causes diarrheal disease in 200–300 million people annually. Similar problems beset research on other human pathogens, including protists in the genera Plasmodium, Trichomonas, and Entamoeba, and bacteria such as Mycobacterium tuberculosis. As these pathogens encompass massive genetic diversity and are often incompatible with standard laboratory culture or reverse genetic technologies, insufficient functional gene annotation hampers basic research and therapeutic development. Aims: Our aim was to predict the proteome of a neglected human pathogen, Giardia duodenalis, and stratify predicted structures into high- and lower-confidence categories using a variety of metrics in isolation and combination. Amino acid identity between query and reference peptides was the greatest individual predictor of high-confidence status; the random forest classifier outperformed any metric in isolation (area under the receiver operating characteristic curve = 0.976) and identified a subset of 305 high-confidence-like models, corresponding to false-positive predictions.

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: GigaScience	Publication Date: Dec 6, 2018
Citations: 21	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Annotation of the Giardia proteome through structure-based homology and machine learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: GigaScience

Lead the way for us

Similar Papers

Segment assembly, structure alignment and iterative simulation in protein structure prediction
Yang Zhang ... Jeffrey Skolnick
BMC Biology | VOL. 11
Yang Zhang, et. al.Yang Zhang ... Jeffrey Skolnick
15 Apr 2013
BMC Biology | VOL. 11

Analysis of Homozygous Serum α1-Antitrypsins: Effects of Neuraminidase
Richard C Talamo ... Carol E Langley
Pediatric Research | VOL. 9
Richard C Talamo, et. al.Richard C Talamo ... Carol E Langley
01 Mar 1975
Pediatric Research | VOL. 9

Antigenic Variation in gp120s from Molecular Clones of HIV-1 LAI
John P Moore ... Joseph Sodroski
AIDS Research and Human Retroviruses | VOL. 9
John P Moore, et. al.John P Moore ... Joseph Sodroski
01 Dec 1993
Antigenic Variation in gp120s from Molecular Clones of HIV-1 LAI
John P Moore ... Joseph Sodroski

Induced Pluripotent Reprogramming from Promiscuous Human Stemness‐Related Factors
Timothy J Nelson ... Almudena Martinez‐Fernandez
Clinical and Translational Science | VOL. 2
Timothy J Nelson, et. al.Timothy J Nelson ... Almudena Martinez‐Fernandez
01 Apr 2009
Clinical and Translational Science | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Annotation of the Giardia proteome through structure-based homology and machine learning.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: GigaScience