Classification and function prediction of proteins using data compression

Ken Sugawara,Toshinori Watanabe

doi:10.1007/bf02481265

Abstract

Protein has a complicated spatial structure, and has chemical and physical functions which originate from this structure. It is important to predict the structure and function of proteins from a DNA sequence or amino acid sequence from the viewpoint of biology, medical science, protein engineering, etc. However, to data there is no way to predict them accurately from these sequences. Instead, some approaches attempt to estimate the functions based on an approximate similarity in the retrieval of sequences. We propose a new method for the similarity retrieval of an amino acid sequence based on the concept of homology retrieval using data compression. The introduction of compression by a dictionary technique enables us to describe the text data as ann-dimensional vector usingn dictionaries, which is generated by compressingn typical texts, and enables us to classify the proteins based on their similarity. We examined the effectiveness of our proposal using real genome data.

Full Text