Statistical inductive inference of protein structural alignments

James H Collier

doi:10.4225/03/58b79813d9110

Abstract

Proteins are complex biological molecules that perform a vast array of functions crucial to life. A small set of computational tasks underpin the study of proteins. One of these supports the comparison of proteins using the notion of alignment. An alignment between proteins allows biologists to understand their evolutionary relationship. Due to the functional constraints that exist on protein biomolecules, finding reliable alignments requires the comparison of their three-dimensional structures (rather than their sequences). The resulting alignments are called protein structural alignments (rather than sequence alignments). The quality of alignments has important consequences for research in protein biology, as they are the foundation for many aspects of protein research. The problem of finding reliable structural alignments is commonly posed as a combinatorial optimisation problem, which requires an optimisation strategy (a search method to find the best alignments) and an objective function (a measure of alignment quality). The objective function must arbitrate a trade-off between the structural fidelity of the proteins being aligned, and the complexity of the alignment itself. The alignment search algorithm then finds the alignment that the scoring function considers optimal. Over the past five decades, many alignment methods have been conceived to identify structural alignments between proteins. Concerningly, the alignments obtained by these methods differ substantially and often produce contradictory results. Many comparative studies on methods generating structural alignments have highlighted the absence of a clear consensus on what constitutes a good structural alignment and the lack of a statistically rigorous measure of alignment quality. This has been stated as a leading cause of the observed proliferation of new structural alignment methods, which tend to perform small modifications to previous approaches. This thesis proposes a fundamental shift in the way structural alignment quality is formalised and measured, and in the way biologically-meaningful alignments are identified. It brings together ideas from fields of information theory, data compression, and statistical inductive inference to develop a statistically rigorous framework to measure structural alignment quality. The resulting alignment quality measure, called I-value, is built on the Bayesian framework of minimum message length inference. Furthermore, this thesis develops a search algorithm that employs I-value to consistently identify high quality and statistically significant structural alignments. This search method is also able to identify significant alternative structural alignments of comparable quality. The culmination of this work is an open-source pairwise structural alignment program called MMLigner (available from http://lcb.infotech.monash.edu.au/mmligner). The performance of MMLigner is benchmarked against popular alignment programs and alignment scoring functions. MMLigner results were found to be highly-competitive compared to other methods, and consistently outperforms other methods in identifying alternative structural alignments, a challenging problem when aligning oligomeric proteins and protein complexes.Awards: Vice-Chancellor's Commendation for Doctoral Thesis in Excellence in 2016.

Full Text