Toward Automatic Recognition of Field Description Syntax and Parser Generation for Genome Databases

Satoshi Takeo ,Kenji Satou

doi:10.11234/gi1990.11.444

Abstract

Due to their flexibility, XML and similar approaches to the management of genome databases have been the topic of increasing attention instead of the traditional schemes based on relational database. However, we are convinced that, except for a few major databases such as GenBank and SWISSPROT, the primary format of genome databases will remain flat files (plain text) in the near future. The reason is that, in general, since the typical user of a genome database is a researcher, an entry of a genome database is inevitably a readable and semi-structured text, rather than a collection of structured data with some rigid schema. In addition, the schema of a genome database may be complicated and/or frequently changed. For a long time, this situation has been an obstacle to the parsing of entries in genome databases, their decomposition, and their storage in fully structured systems such as DBMS and XML. To solve this problem, it is needed to establish a methodology for automatically recognizing the various description syntax of fields in human-readable and semi-structured text database like genome databases. For example, most of the fields for “description” and “comment” are usually written in free format throughout successive lines, since they include sentences written in natural language. On the other hand, fields for “keyword” and “organism” have simple syntax for enumerating short terms with delimiters like semicolon. Apart from such extreme cases, genome databases contain various field description syntax for human readability rather than machine readability. For accelerating XML-related researches in bioinformatics, and for making better use of data elements in semi-structured genome databases, we started a research on the automatic recognition of field description syntax in genome databases. In this research, we focused on the pattern of occurrence of special characters (including new-line characters) in a field. Since most of semi-structured texts utilize them for representing the description syntax, they might be a clue to recognize it. More generally, this recognition problem can be regarded as the discovery of a pattern, like a regular expression, from string data given as sample database entries. Using least generalization, we expect to find a pattern which covers most of the samples and might be approximately similar to results obtained by human recognition. Furthermore, successful recognition is directly linked to automatic parser generation.

Full Text