In this paper we discuss a persistent problem arising from polysemy: namely the difficulty of finding consistent criteria for making fine-grained sense distinctions, either manually or automatically. We investigate sources of human annotator disagreements stemming from the tagging for the English Verb Lexical Sample Task in the SENSEVAL-2 exercise in automatic Word Sense Disambiguation. We also examine errors made by a high-performing maximum entropy Word Sense Disambiguation system we developed. Both sets of errors are at least partially reconciled by a more coarse-grained view of the senses, and we present the groupings we use for quantitative coarse-grained evaluation as well as the process by which they were created. We compare the system's performance with our human annotator performance in light of both fine-grained and coarse-grained sense distinctions and show that well-defined sense groups can be of value in improving word sense disambiguation by both humans and machines.
Read full abstract