Review: Hundt, Nesselhauf and Biewer (eds, 2006) Corpus Linguistics and the Web. Amsterdam/New York: Rodopi

Marina Santini

doi:10.3366/e1749503209000318

Abstract

Corpus Linguistics and the Web is an edited collection of articles from papers presented at the 2004 symposium ‘Corpus Linguistics – Perspectives for the Future’ held in Heidelberg in 2004, and articles commissioned from leading scholars (p. 4). The book is a comprehensive, insightful and well-structured compendium of advantages and disadvantages of using web data for linguistic description and corpus compilation. The main message conveyed by the book, as a whole, is that traditional corpora and web data can complement each other. The book is a good resource for corpus linguists who find traditional corpora too small, or not sufficiently representative, for their research. It can also be useful for computational linguists and information scientists who are interested in linguistic and textual features. The book begins with a short introduction written by the three editors, Hundt, Nesselhauf and Biewer, that summarises the main issues and perspectives. The volume is divided into four parts, each containing a variable number of articles. The first part, ‘Accessing the web as corpus’, describes the benefits and the pitfalls of using data from the web. On the one hand, shortcomings, such as the impossibility of replication and the absence of meta-data (Ludelink et al.; and Fletcher), must be kept in mind when assessing findings from web data. On the other hand, the richness and freshness of web material (Fletcher; and Renouf et al.) seem to outweigh the downside, and encourage the development of web-as-corpus applications or other such initiatives (e.g., WaCky, WebKWIC and WebCorp). One major drawback of the web-as-corpus approach is, however, the reliance on commercial search engines (like Google) that have very rough linguistic sensibilities. These decide the relevant pages for a search using opaque criteria, and so require tedious refinements of the results that are returned. The second part, ‘Compiling corpora from the internet’, focusses on the construction of corpora from the web – an unrivalled textual reservoir in terms of size and new genres or registers. For instance, Hoffmann takes advantage of the plenitude of publicly available CNN transcripts in order to create a specialised corpus of spoken English. Similarly, Claridge builds a corpus of public message board postings and examines how interaction and stance markers are distributed in this genre of computer-mediated

Full Text