Abstract
In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large margin. We also show that our algorithm can yield better performance than the Most Common Sense (MCS) baseline on one data set. Moreover, our algorithm has a very small number of parameters, is robust to parameter tuning, and, unlike other bio-inspired methods, it gives a deterministic solution (it does not involve random choices).
Highlights
Word Sense Disambiguation (WSD), the task of identifying which sense of a word is used in a given context, is a core NLP problem, having the potential to improve many applications such as machine translation (Carpuat and Wu, 2007), text summarization (Plaza et al, 2011), information retrieval (Chifu and Ionescu, 2012; Chifu et al, 2014) or sentiment analysis (Sumanth and Inkpen, 2015)
We compare them with the Most Common Sense (MCS) baseline which is based on human annotations
By using sense embeddings in a completely different way than Bhingardive et al (2015), we are able to report an F1 score of 59.82%, which is much closer to the MCS baseline (62.30%)
Summary
Word Sense Disambiguation (WSD), the task of identifying which sense of a word is used in a given context, is a core NLP problem, having the potential to improve many applications such as machine translation (Carpuat and Wu, 2007), text summarization (Plaza et al, 2011), information retrieval (Chifu and Ionescu, 2012; Chifu et al, 2014) or sentiment analysis (Sumanth and Inkpen, 2015). Most of the existing WSD algorithms (Agirre and Edmonds, 2006; Navigli, 2009) are commonly classified into supervised, unsupervised, and knowledge-based techniques, but hybrid approaches have been proposed in the literature (Hristea et al, 2008). The main disadvantage of supervised methods (that have led to the best disambiguation results) is that they require a large amount of annotated data which is difficult to obtain. We introduce a novel WSD algorithm, termed ShotgunWSD1, that stems from the Shotgun genome sequencing technique (Anderson, 1981; Istrail et al, 2004). Our WSD algorithm is unsupervised, but it requires knowledge in the form of WordNet (Miller, 1995; Fellbaum, 1998) synsets and relations as well. Our algorithm can be regarded as a hybrid approach
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.