Abstract

Document distance is a fundamental yet significant research topic in the information retrieval community, and its accuracy dominates the performance of many text retrieval applications. Beyond the Bag-of-Words (BoW) model, the Word Mover's Distance (WMD) semantically defines the distance between documents as the minimum cost (i.e., measured by word similarities of embeddings) required to transport the words from one document to another, and it has been proven to be superior by k -nearest neighbor classification. In this article, we thoroughly study the characteristics of WMD and its relaxed versions, e.g., Relaxed WMD (RWMD) and Iterative Constrained Transfers (ICT), in various scenarios. Specifically, we concentrate on the problem of negative word similarity: the WMD family leverages all word similarities, however, most of them are meaningless, resulting in negative effects for measuring document distances. To remedy this problem, we propose Informative Similarity Filter (ISF), which retains a very small percentage of top word similarities and fixes the others as the same lower similarity. Built on it, we propose a greedy optimization (GOM) for WMD, an accurate approximation to WMD. We theoretically analyze that ISF-GOM is more applicable for relatively longer documents. Extensive experiments have been conducted to validate: 1) the problem of RWMD; 2) the effectiveness of ISF-GOM; and 3) the consistence of our analysis of ISF-GOM. Our codes and datasets are available at https://github.com/BoCheng-96/ISF-GOM.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.