A functional toolkit for morphological and phonological processing, application to a Sanskrit tagger

Gérard Huet

doi:10.1017/s0956796804005416

Abstract

We present the Zen toolkit for morphological and phonological processing of natural languages. This toolkit is presented in literate programming style, in the Pidgin ML subset of the Objective Caml functional programming language. This toolkit is based on a systematic representation of finite state automata and transducers as decorated lexical trees. All operations on the state space data structures use the zipper technology, and a uniform sharing functor permits systematic maximum sharing as dags. A particular case of lexical maps is specially convenient for building invertible morphological operations such as inflected forms dictionaries, using a notion of differential word. As a particular application, we describe a general method for tagging a natural language text given as a phoneme stream by analysing possible euphonic liaisons between words belonging to a lexicon of inflected forms. The method uses the toolkit methodology by constructing a non-deterministic transducer, implementing rational rewrite rules, by mechanical decoration of a trie representation of the lexicon index. The algorithm is linear in the size of the lexicon. A coroutine interpreter is given, and its correctness and completeness are formally proved. An application to the segmentation of Sanskrit by sandhi analysis is demonstrated.

Full Text