Automated Alphabet Reduction for Protein Datasets

Jaume Bacardit,Natalio Krasnogor,Robert E Smith,Michael Stout,Alfonso Valencia,Jonathan D Hirst

doi:10.1186/1471-2105-10-6

Abstract

BackgroundWe investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.ResultsWe applied this protocol to the prediction of two protein structural features: contact number and relative solvent accessibility. For both features we generated alphabets of two, three, four and five letters. The five-letter alphabets gave prediction accuracies statistically similar to that obtained using the full amino acid alphabet. Moreover, the automatically designed alphabets were compared against other reduced alphabets taken from the literature or human-designed, outperforming them. The differences between our alphabets and the alphabets taken from the literature were quantitatively analyzed. All the above process had been performed using a primary sequence representation of proteins. As a final experiment, we extrapolated the obtained five-letter alphabet to reduce a, much richer, protein representation based on evolutionary information for the prediction of the same two features. Again, the performance gap between the full representation and the reduced representation was small, showing that the results of our automated alphabet reduction protocol, even if they were obtained using a simple representation, are also able to capture the crucial information needed for state-of-the-art protein representations.ConclusionOur automated alphabet reduction protocol generates competent reduced alphabets tailored specifically for a variety of protein datasets. This process is done without any domain knowledge, using information theory metrics instead. The reduced alphabets contain some unexpected (but sound) groups of amino acids, thus suggesting new ways of interpreting the data.

Highlights

We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets
This paper develops the use of information theory based automated procedures for alphabet reduction in Protein Structure Prediction (PSP) datasets
Our investigations indicate that: (1) finding a reduced alphabet with a performance that is statistically equivalent to the performance obtained with the full amino acid (AA) type representation is possible, (2) this does not compromise accuracy and enhances interpretability and (3) different problems might require different reductions and (4) the alphabets obtained from primary sequence data can be successfully adapted to richer representations using evolutionary information

Summary

Introduction

We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. The prediction of the 3D structure of protein chains, known as Protein Structure Prediction (PSP), is a key challenge in structural bioinformatics. Rosetta@home [1], one of the top predictors in the CASP7 (Critical Assessment of techniques for protein Structure Prediction) experiment, used up to 10000 computing days to model a single protein. One way in which PSP calculations might be accelerated is by using a divide-and-conquer approach, where the problem of predicting the tertiary structure of a given sequence is split into smaller challenges, such as predicting secondary structure, solvent accessibility, coordination number, etc. The alphabet by which the sequence of a protein is represented would be an obvious focus for any reduction mechanism

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 6, 2009
Citations: 103	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Automated Alphabet Reduction for Protein Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Internal Versus Forced Variability Metrics for General Circulation Models Using Information Theory
Aakash Sane ... David S Ullman
Journal of Geophysical Research: Oceans | VOL. 129
Aakash Sane, et. al.Aakash Sane ... David S Ullman
01 May 2024
Journal of Geophysical Research: Oceans | VOL. 129

ZIPPER: The holistic spell checker
Lina Alhusaini
-
Lina AlhusainiLina Alhusaini
01 Oct 2012
01 Oct 2012

On the use of information theory metrics for detecting DDoS attacks and flash events: an empirical analysis, comparison, and future directions
Jagdeep Singh ... Navjot Jyoti
Kuwait Journal of Science | VOL. 48
Jagdeep Singh, et. al.Jagdeep Singh ... Navjot Jyoti
17 Aug 2021
Kuwait Journal of Science | VOL. 48

Modelling the heart as a communication system.
Hiroshi Ashikaga ... José Aguilar-Rodríguez
Journal of The Royal Society Interface | VOL. 12
Hiroshi Ashikaga, et. al.Hiroshi Ashikaga ... José Aguilar-Rodríguez
01 Apr 2015
Journal of The Royal Society Interface | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated Alphabet Reduction for Protein Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics