Abstract

Abstract : This document describes a project to explore the use of Latent Semantic Analysis (LSA) and statistical clustering techniques for automatically identifying word senses and for estimating word sense frequencies from application relevant corpora. The hypothesis is that LSA can be used to compute context vectors for ambiguous words that can be clustered together - with each cluster corresponding to a different sense of the word. The document is organized as follows: the first section includes a short introduction to LSA, an introduction to the context-group discrimination paradigm adopted in the project, and a description of the corpus used in the experiments. Section 2 describes the investigation of the effect of LSA dimensionality on sense discrimination accuracy. Overall, sense discrimination accuracy was relatively low. This motivated a digression into investigation of the influence of different distance measures; investigation of the geometry of the sense clusters in the LSA-based space through silhouette value analysis; investigation of sense discrimination accuracy as a function of the degree of supervision provided during model training; and investigation and comparison of sense discrimination in homonyms versus polysemes. Section three describes the investigation of optimal context size for word sense discrimination from 3 (1 word on each side of word) to 11 words (5 words on each side). Section 4 describes the use of Minimal Description Length (MDL) to determine the number of word senses. Section 5 provides a project summary. Appendix A provides a literature review and Appendix B provides a source code listing (not included in this published report).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.