Large Protein Datasets Research Articles

BackgroundDetermining a protein’s quaternary state, i.e. the number of monomers in a functional unit, is a critical step in protein characterization. Many proteins form multimers for their activity, and over 50% are estimated to naturally form homomultimers. Experimental quaternary state determination can be challenging and require extensive work. To complement these efforts, a number of computational tools have been developed for quaternary state prediction, often utilizing experimentally validated structural information. Recently, dramatic advances have been made in the field of deep learning for predicting protein structure and other characteristics. Protein language models, such as ESM-2, that apply computational natural-language models to proteins successfully capture secondary structure, protein cell localization and other characteristics, from a single sequence. Here we hypothesize that information about the protein quaternary state may be contained within protein sequences as well, allowing us to benefit from these novel approaches in the context of quaternary state prediction.ResultsWe generated ESM-2 embeddings for a large dataset of proteins with quaternary state labels from the curated QSbio dataset. We trained a model for quaternary state classification and assessed it on a non-overlapping set of distinct folds (ECOD family level). Our model, named QUEEN (QUaternary state prediction using dEEp learNing), performs worse than approaches that include information from solved crystal structures. However, it successfully learned to distinguish multimers from monomers, and predicts the specific quaternary state with moderate success, better than simple sequence similarity-based annotation transfer. Our results demonstrate that complex, quaternary state related information is included in such embeddings.ConclusionsQUEEN is the first to investigate the power of embeddings for the prediction of the quaternary state of proteins. As such, it lays out strengths as well as limitations of a sequence-based protein language model approach, compared to structure-based approaches. Since it does not require any structural information and is fast, we anticipate that it will be of wide use both for in-depth investigation of specific systems, as well as for studies of large sets of protein sequences. A simple colab implementation is available at: https://colab.research.google.com/github/Furman-Lab/QUEEN/blob/main/QUEEN_prediction_notebook.ipynb.

Nitrogen (N) fertilizer is essential to ensure grain yield and quality in bread wheat. Improving N use efficiency is therefore crucial for wheat grain protein quality. In the present work, we analysed the effects on the winter wheat grain proteome of biostimulants containing Glutacetine® or two derived formulations (VNT1 and 4) when mixed with urea-ammonium-nitrate fertilizer. A large-scale quantitative proteomics analysis of two wheat flour fractions produced a dataset of 4369 identified proteins. Quantitative analysis revealed 9, 39 and 96 proteins with a significant change in abundance after Glutacetine®, VNT1 and VNT4 treatments, respectively, with a common set of 11 proteins that were affected by two different biostimulants. The major effects impacted proteins involved in (i) protein synthesis regulation (mainly ribosomal and binding proteins), (ii) defence and responses to stresses (including chitin-binding protein, heat shock 70 kDa protein 1 and glutathione S-transferase proteins), (iii) storage functions related to gluten protein alpha-gliadins and starch synthase and (iv) seed development with proteins implicated in protease activity, energy machinery, and the C and N metabolism pathways. Altogether, our study showed that Glutacetine®, VNT1 and VNT4 biostimulants positively affected protein composition related to grain quality.Data are available via ProteomeXchange with identifier PXD021513. SignificanceWe performed a large-scale quantitative proteomics study of the total protein extracts from flour samples to determine the effect of Glutacetine®-based biostimulants treatment on the protein composition of bread wheat grain. To our knowledge, only a few studies in the literature have applied proteomic approaches to study bread wheat grains and in particular to investigate the effect of biostimulants on the grain proteome of this cereal crop. In addition, most approaches used fractional extraction of proteins to target reserve proteins followed electrophoresis which leads to low identification rate of proteins. We identified and quantified a large protein dataset of 4369 proteins and determined ontological class of proteins affected by biostimulants treatments. Our proteomics investigation revealed the important role of these new biostimulants in achieving significant changes in protein synthesis regulation, storage functions, protease activity, energy machinery, C and N metabolism pathways and responses to biotic and abiotic stresses in grain.

Large Protein Datasets Research Articles

Related Topics

Articles published on Large Protein Datasets

Application of artificial intelligence and machine learning techniques to the analysis of dynamic protein sequences.

INGNN-DTI: prediction of drug-target interaction with interpretable nested graph neural network and pretrained molecule models.

RedRibbon: A new rank-rank hypergeometric overlap for gene and transcript expression signatures.

Protein language models can capture protein quaternary state

Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets.

COLLAPSE: A representation learning framework for identification and characterization of protein structural sites.

Estimating amino acid substitution models for metazoan evolutionary studies.

Evaluating Mineral Lattices as Evolutionary Proxies for Metalloprotein Evolution.

Pathogenic variation types in human genes relate to diseases through Pfam and InterPro mapping.

GRaSP-web: a machine learning strategy to predict binding sites based on residue neighborhood graphs.

PolyX2: Fast Detection of Homorepeats in Large Protein Datasets.

Inter-paralog amino acid inversion events in large phylogenies of duplicated proteins.

Complementarity of the residue-level protein function and structure predictions in human proteins

DispHScan: A Multi-Sequence Web Tool for Predicting Protein Disorder as a Function of pH.

PARROT is a flexible recurrent neural network framework for analysis of large protein datasets.

Dynamics-Evolution Correspondence in Protein Structures.

Protein secondary structure prediction (PSSP) using different machine algorithms

Biostimulant impacts of Glutacetine® and derived formulations (VNT1 and VNT4) on the bread wheat grain proteome

Avoided motifs: short amino acid strings missing from protein datasets.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Protein Datasets Research Articles

Related Topics

Articles published on Large Protein Datasets

Application of artificial intelligence and machine learning techniques to the analysis of dynamic protein sequences.

INGNN-DTI: prediction of drug-target interaction with interpretable nested graph neural network and pretrained molecule models.

RedRibbon: A new rank-rank hypergeometric overlap for gene and transcript expression signatures.

Protein language models can capture protein quaternary state

Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets.

COLLAPSE: A representation learning framework for identification and characterization of protein structural sites.

Estimating amino acid substitution models for metazoan evolutionary studies.

Evaluating Mineral Lattices as Evolutionary Proxies for Metalloprotein Evolution.

Pathogenic variation types in human genes relate to diseases through Pfam and InterPro mapping.

GRaSP-web: a machine learning strategy to predict binding sites based on residue neighborhood graphs.

PolyX2: Fast Detection of Homorepeats in Large Protein Datasets.

Inter-paralog amino acid inversion events in large phylogenies of duplicated proteins.

Complementarity of the residue-level protein function and structure predictions in human proteins

DispHScan: A Multi-Sequence Web Tool for Predicting Protein Disorder as a Function of pH.

PARROT is a flexible recurrent neural network framework for analysis of large protein datasets.

Dynamics-Evolution Correspondence in Protein Structures.

Protein secondary structure prediction (PSSP) using different machine algorithms

Biostimulant impacts of Glutacetine® and derived formulations (VNT1 and VNT4) on the bread wheat grain proteome

Avoided motifs: short amino acid strings missing from protein datasets.