Abstract

Lexical Collocations are frequently occurring word pairs in natural language whose presence are not always predictable by their usage. These collocations are used by native speakers of a language almost without thought; yet they must be learned by non-native speakers of that language. A native speaker of English may drink strong coffee while a non-native speaker may say either $\sp{*}$powerful coffee or $\sp{*}$sturdy coffee. Collocations tend to vary among languages and topic domains. Unfortunately, the task of correctly identifying lexical collocations, even by native speakers of that language, has been shown to be very difficult. Computer systems that translate natural languages, or Machine Translation (MT) systems, need to know about lexical collocation information in order to produce natural sounding or colloquially proper text. Natural Language Generation (NLG) is a component of an MT system which automatically produces natural sounding text in a particular target language given a language-independent meaning as input. This dissertation will demonstrate how to automatically locate and extract lexical collocations from machine-readable text for use within an MT system's NLG component. A lexical-semantic and statistical approach is adopted for the location and extraction of lexical collocations. For this approach, a computational definition is provided for lexical collocations which demonstrates that: (1) they occur as adjacent word pairs; (2) they occur more often than would be expected by chance; and (3) they comprise words for which neither word may be substituted by a synonym or hyponym. Potential collocations comprising certain adjacent part-of-speech tags are extracted from text. An on-line thesaurus and lexical database of word classes are queried for synonyms and hyponyms, respectively, for each potential collocation. These queries create potential challenger pairs, such as strong java and powerful coffee. A substitution procedure is then applied to determine if any of these challenging word pairs occur more frequently than the potential collocation. The VERIFY lexical collocation extraction system has been implemented incorporating these ideas. Results to date have been positive: using lexical-semantic knowledge, i.e., synonymy and hyponymy, within a lexical collocation extraction system outperforms a system using purely statistical knowledge. In order to compare system output to human judgments of training data, a training component was also incorporated into VERIFY. This component is able to adapt to new data. Overall system performance, measured by Recall and Precision scores, was shown to improve using this component. In order to provide a more flexible system given a user's application, a weighting mechanism was used to produce a range of Recall and Precision scores. These weights can be 'adjusted' to optimize system performance. The use of lexical-semantic knowledge has advanced the state of the art for lexical collocation extraction beyond traditional statistical approaches. Incorporation of a training component within an extraction system provides the capability of adapting to any changes within the data. Controlling overall system performance through the use of a weighting mechanism provides flexibility to the user of an extraction system. And, in an experiment to compare VERIFY'S performance to that of human performance on a particular set of data, it was shown that VERIFY outperforms humans in both Recall and Precision.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call