Abstract

Extensible Dependency Grammar (XDG) is new, modular grammar formalism for natural language. An XDG analysis is a multi-dimensional dependency graph, where each dimension represents a different aspect of natural language, e.g. syntactic function, predicate-argument structure, information structure etc. Thus, XDG brings together two recent trends in computational linguistics: the increased application of ideas from dependency grammar and the idea of multi-layered linguistic description. In this paper, we tackle one of the stumbling blocks of XDG so far—its incomplete formalization. We present the first complete formalization of XDG, as a description language for multigraphs based on simply typed lambda calculus. Introduction Extensible Dependency Grammar (XDG) (Debusmann et al. 2004) brings together two recent trends from computational linguistics: 1. dependency grammar 2. multi-layered linguistic description Firstly, the ideas of dependency grammar, lexicalization, the head-dependent asymmetry, valency etc., have become more and more popular in computational linguistics. Most of the popular grammar formalisms like Combinatorial Categorial Grammar (CCG) (Steedman 2000), Headdriven Phrase Structure Grammar (HPSG) (Pollard & Sag 1994), Lexical Functional Grammar (LFG) (Bresnan 2001) and Tree Adjoining Grammar (TAG) (Joshi 1987) have already adopted these ideas. Moreover, the most successful approaches statistical parsing crucially depend on notions from dependency grammar (Collins 1999), and new treebanks based on dependency grammar are being developed for various languages, e.g. the Prague Dependency Treebank (PDT) for Czech and the TiGer Dependency Bank for German. Secondly, many treebanks such as the Penn Treebank, the TiGer Treebank and the PDT are continuously being extended with additional layers of annotation in addition to the syntactic layer, i.e. they become more and more multilayered. For example, the PropBank (Kingsbury & Palmer Copyright c © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. 2002) (Penn Treebank), the SALSA project (Erk et al. 2003) (TiGer Treebank) and the tectogrammatical layer (PDT) add a layer of predicate-argument structure. Other added layers concern information structure (PDT) and discourse structure as in the Penn Discourse Treebank (Webber et al. 2005). These additional layers of annotation are often dependencylike, i.e. could be straightforwardly represented in a framework for dependency grammar which is multi-layered. XDG is such a framework. It has already been successfully applied to model a relational syntax-semantics interface (Debusmann et al. 2004) and to model the relation between prosodic structure and information structure in English (Debusmann, Postolache, & Traat 2005). We hope to soon be able to employ XDG to directly make use of the information contained in the new multi-layered treebanks, e.g. for the automatic induction of multi-layered grammars for parsing and generation. To achieve this goal, XDG still needs to overcome a number of weaknesses. The first is the lack of a polynomial parsing algorithm—so far, we only have a parser based on constraint programming (Debusmann, Duchier, & Niehren 2004), which is fairly efficient, given that the parsing problem is NP-hard, but does not scale up to large-scale grammars. The second major stumbling block of XDG so far is the lack of a complete formalization. The latter is what we will change in this paper: we will present a formalization of XDG as a description language for multigraphs based on simply typed lambda calculus (Church 1940; Andrews 2002). To give a hint of the expressivity of XDG, we additionally present a proof that the parsing problem of (unrestricted) XDG is NP-hard. We begin the paper with introducing the notion of multigraphs. Multigraphs Multigraphs are motivated by dependency grammar, and in particular by its structures: dependency graphs. Dependency Graphs Dependency graphs such as the one in Figure 1 typically represent the syntactic structure of sentences in natural language. They have the following properties: 1. Each node (round circle) is associated with a word (today, Peter, wants etc.), which is connected to the corresponding node by a dotted vertical line called projection edge,

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call