Motif Discovery in Protein Structure Databases

Janice Glasgow,Evan Steeg

doi:10.1093/oso/9780195119404.003.0011

Abstract

The field of knowledge discovery is concerned with the theory and processes involved in the representation and extraction of patterns or motifs from large databases. Discovered patterns can be used to group data into meaningful classes, to summarize data, or to reveal deviant entries. Motifs stored in a database can be brought to bear on difficult instances of structure prediction or determination from X-ray crystallography or nuclear magnetic resonance (NMR) experiments. Automated discovery techniques are central to understanding and analyzing the rapidly expanding repositories of protein sequence and structure data. This chapter deals with the discovery of protein structure motifs. A motif is an abstraction over a set of recurring patterns observed in a dataset; it captures the essential features shared by a set of similar or related objects. In many domains, such as computer vision and speech recognition, there exist special regularities that permit such motif abstraction. In the protein science domain, the regularities derive from evolutionary and biophysical constraints on amino acid sequences and structures. The identification of a known pattern in a new protein sequence or structure permits the immediate retrieval and application of knowledge obtained from the analysis of other proteins. The discovery and manipulation of motifs—in DNA, RNA, and protein sequences and structures—is thus an important component of computational molecular biology and genome informatics. In particular, identifying protein structure classifications at varying levels of abstraction allows us to organize and increase our understanding of the rapidly growing protein structure datasets. Discovered motifs are also useful for improving the efficiency and effectiveness of X-ray crystallographic studies of proteins, for drug design, for understanding protein evolution, and ultimately for predicting the structure of proteins from sequence data. Motifs may be designed by hand, based on expert knowledge. For example, the Chou-Fasman protein secondary structure prediction program (Chou and Fasman, 1978), which dominated the field for many years, depended on the recognition of predefined, user-encoded sequence motifs for α-helices and β-sheets. Several hundred sequence motifs have been cataloged in PROSITE (Bairoch, 1992); the identification of one of these motifs in a novel protein often allows for immediate function interpretation.

Full Text