A practical implementation of DCGs

Jukka Paakki

doi:10.1007/3-540-53669-8_91

Abstract

Definite clause grammars (DCGs) are a logic counterpart to context-free grammars, the most widely-used formalism for defining the syntax of languages. The conventional implementation strategy of DCGs is translation into the logic programming language Prolog, giving rise to a parser for the defined language. This translation can be carried out in a rather straightforward way, and that is why DCGs are provided as an enhancement in a number of Prolog systems. Definite clause grammars were originally presented in [PEW80] as a formalism for describing natural languages. The implementation method presented there (and adopted as such in the Prolog systems) thus takes into account the most general structure of natural languages and produces a parser which is inherently nondeterministic. The appealing combination of a definition formalism (context-free grammars) and a direct operational realization of that formafism (parser) has been the inspiration for applying DCGs also outside their original intended domain. The most notable example is compiler writing for programming languages. While DCGs have been commonly presented as an advanced compiler writing tool (e.g. [STS86], [Szp87]), a more careful analysis of their traditional implementation strategy reveals shortcomings with respect to practical parsing of programming languages. The problems can be immediately noticed when taking a look on the way a DCG is transformed into ordinary Prolog code: 1. Nondeterministic parsing is simulated with Prolog's normal backtracking mechanism. Now the order of the alternative productions has great significance on the efficiency of the parser since they are tried in the order of appearance within the DCG. 2. The parser cannot in general deal with leff-recursive productions (nor with regular expressions) and may loop infinitely when trying to apply such a production. This may happen both for syntactically legal inputs as well as for syntactically illegal ones. 3. No recognition of syntax errors is provided, but instead the parser simply fails (or loops) with a syntactically erroneous input. 4. Scanning cannot be interleaved with parsing since the source program is represented as a complete fist of tokens. This leads necessarily to at least two passes over the source program just for parsing it. Since DCGs as a notation are rather compact and elegant, they certainly are a powerful tool in experimental and prototyping programming language implementation. However, when aiming at practical applications, the standard implementation makes DCGs unusable. We have produced an implementation of DCGs that drives at removing the troubtespots, at the same time retaining the general and powerful notation. Our system accepts DCGs in their standard form, but the implementation most notably circumvents the problems 1 and 3 mentioned above: ha our system a DCG gives rise to a deterministic error-recovering parser that never fails or loops. The main objective of our DCG implementation is to provide an automatic error handling mechanism. Since syntactic error detection and recovery are most laborious in connection with nondeterministic parsing, and since ambiguous grammars are rarely needed in defmlng modern programming languages, we have decided to base our DCG facility on determin~tic parsing. The particular grammar class is LL(1),

Full Text