Lexical Parsing Expression Recognition Schemata

Markus Lumpe

doi:10.1109/aswec.2015.27

Abstract

Parsing expression grammars (PEGs) have emerged as a promising substitute for context-free grammars (CFGs) and regular expressions (REs) in programming language specification. The benefits of PEGs are twofold. First, parsing expression grammars replace unordered choice between alternatives by prioritized choice, which naturally solves the ubiquitous dangling else problem in grammar definitions. Second, PEGs employ syntax specifications that eliminate the need to separate the lexical and hierarchical components of a language specification. However, there is no free lunch in PEGs. PEGs capture only syntactic relationships, but many language constructs cannot be parsed without additional semantic information. Moreover, character-level specifications can become unwieldy, as every aspect of the language, including spacing, has to be accounted for. To overcome these issues, we extend the original PEG formalism to incorporate semantic predicates that yield a programmatic means for state-based token recognition control. Furthermore, rather than requiring a single complete specification, we capture lexical components as PEG closures that provide a self-contained token recognition mechanism to reduce the clutter associated with purely character-level PEGs. To test the effectiveness of our approach, we use it for the construction of a Delphi language front-end and practically confirm that Ford's theoretical linear-time result also holds for PEG closures.

Full Text