Abstract
Document distance is a fundamental yet significant research topic in the information retrieval community, and its accuracy dominates the performance of many text retrieval applications. Beyond the Bag-of-Words (BoW) model, the Word Mover's Distance (WMD) semantically defines the distance between documents as the minimum cost (i.e., measured by word similarities of embeddings) required to transport the words from one document to another, and it has been proven to be superior by k -nearest neighbor classification. In this article, we thoroughly study the characteristics of WMD and its relaxed versions, e.g., Relaxed WMD (RWMD) and Iterative Constrained Transfers (ICT), in various scenarios. Specifically, we concentrate on the problem of negative word similarity: the WMD family leverages all word similarities, however, most of them are meaningless, resulting in negative effects for measuring document distances. To remedy this problem, we propose Informative Similarity Filter (ISF), which retains a very small percentage of top word similarities and fixes the others as the same lower similarity. Built on it, we propose a greedy optimization (GOM) for WMD, an accurate approximation to WMD. We theoretically analyze that ISF-GOM is more applicable for relatively longer documents. Extensive experiments have been conducted to validate: 1) the problem of RWMD; 2) the effectiveness of ISF-GOM; and 3) the consistence of our analysis of ISF-GOM. Our codes and datasets are available at https://github.com/BoCheng-96/ISF-GOM.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: IEEE transactions on neural networks and learning systems
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.