Reviewed by: Treebanks: Building and using parsed corpora ed. by Anne Abeillé Philip Resnik Treebanks: Building and using parsed corpora. Ed. by Anne Abeillé. Dordrecht: Kluwer, 2003. Pp. 440. ISBN 1402013353. $74.95. Annotated corpora have been the fuel for a number of recent advances in the study of language, notably, although not exclusively, in computational linguistics. At the sentence level, corpus annotations range from shallow levels of linguistic representation, such as part-of-speech categories (Francis & Kučera 1982, Leech et al. 1994) or named entities (Strassel et al. 2003), through intermediate levels such as argument structure (Meyers et al. 2004, Palmer et al. 2005), to deeper levels of semantic representation such as semantic roles (Baker et al. 1998), word senses (Landes et al. 1998), events and temporal relations (Pustejovsky et al. 2003), or language-independent meaning representations (Farwell et al. 2004). When it comes to annotating sentences with linguistic representation, the sky is the limit (Meyers 2005). The ‘sweet spot’ in this range of annotations is occupied by treebanks, which is to say parsed (syntactically annotated) corpora. Unlike shallower annotations, syntactic parses capture hierarchical organization, a fundamental notion in virtually any modern theory of sentence structure. But unlike most deeper representations, syntactic parses can be created with high levels of inter-annotator reliability (though see Hovy et al. 2006 and references therein for recent progress in semantic treebanking). With respect to natural language processing applications, the impact of treebanks during the last decade has been remarkable, more than validating Marcus and colleagues’ (1993) premise that ‘significant, rapid progress can be made … by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora’ (313). Beyond applications, syntactically annotated corpora have also begun to play an increasingly productive role in psycholinguistics, theoretical syntax, and language pedagogy (e.g. Corley et al. 2001, Jurafsky 2002, Dillon 2005, Meurers 2005, Resnik et al. 2005). In Treebanks: Building and using parsed corpora, Anne Abeille´ draws together a collection of fifteen short pieces focused primarily on the issues that come up in creating treebanks, demonstrated across an impressive variety of languages, along with six chapters on how treebanks are used. Although twenty-one chapters cannot be covered in detail in a short space, I present a brief walk through the chapters, followed by a discussion of the book as a whole. Abeillé’s introduction offers a very concise but clearly written primer on the main issues that come up in choosing representations, annotating corpora, and using treebanks in applications, folding in the obligatory pointers to the chapters that follow. In ‘The Penn Treebank: An overview’, Anne Taylor, Mitchell Marcus, and Beatrice Santorini provide a short, accessible description of the widely used Penn Treebank, extracting and updating the seminal article by Marcus and colleagues (1993) (which still remains the definitive source for in-depth discussion). In ‘Thoughts on two decades of drawing trees’, Geoffrey Sampson offers an engaging, personal discussion that combines elements of corpus description, linguistic analysis, and position paper. Unlike other chapters in the book, Sampson’s chapter offers a high-level look at corpus linguistics as an engineering and scientific discipline, and, contrary to some treebanking work, suggests that corpus annotation should make detail, accuracy, and explicitness a higher priority than the number of sentences annotated. Among the next thirteen chapters, nine provide detailed discussions of treebanking projects for specific languages, following a pattern that generally includes: (1) the goals and historical context of the project, (2) the selection of data to annotate (usually news text), (3) the annotation process (usually automatic analysis followed by human correction using project-specific tools), (4) details of representation (admitting wide variety, but usually an elaboration on syntactic constituency or grammatical dependency representation, with additional features to address language-specific [End Page 876] issues), (5) tools used for manual creation and correction of annotations (again admitting very wide variety), (6) a set of detailed, language-specific annotation choices that illustrate interesting and challenging aspects of the language under consideration, (7) a description of the project’s status, and (8) a brief evaluation and...
Read full abstract