Lexical collocations have particular statistical distributions. We have developed a set of statistical techniques for retrieving and identifying collocations from large textual corpora. The techniques we developed are able to identify collocations of arbitrary length as well as flexible collocations. These techniques have been implemented in a lexicographic tool, Xtract, which is able to automatically acquire collocations with high retrieval performance. Xtract works in three stages. The first stage is based on a statistical technique for identifying word pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word collocations (or n-grams) in a much simpler way than related methods. These collocations can involve closed class words such as particles and prepositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pair in order to identify the proper syntactic relation between the two words. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher quality output. In this paper we present an overview of Xtract and we describe several uses for Xtract and the knowledge it retrieves such as language generation and machine translation.
Read full abstract