Survey of Extraction Methods

Violeta Seretan

doi:10.1007/978-94-007-0134-2_3

Abstract

In this chapter, we discuss the foundations of computational methods of collocation extraction from text corpora. First, we analyse the extent to which the collocation features stipulated by theoretical studies are taken into account in practice. Then, we introduce the basic concepts of statistical modelling of collocations as significant word associations, and describe the association measures typically used in the field. We then survey the role played by lemmatizers, POS taggers, chunkers and syntactic parsers in preprocessing source corpora with the aim of improving extraction performance. The rest of the chapter contains a thorough review of the state of the art in collocation extraction which provides relevant details about the linguistic preprocessing performed in existing extraction work. Despite the fact that efficient syntactic parsers are now increasingly available, this preprocessing is currently limited in most cases to shallow methods such as parsing based on pattern matching over POS tags. These methods are inadequate, however, as many researchers have pointed out, for those languages exhibiting a freer word order and richer morphology than English. With this in mind, we will argue that successful collocation extraction across languages therefore requires a more elaborate structural analysis which can only be provided by deep parsing.

Full Text