Eliminating Negative Word Similarities for Measuring Document Distances: A Thoroughly Empirical Study on Word Mover's Distance.

Bo Cheng,Ximing Li,Yi Chang

doi:10.1109/tnnls.2022.3222336

Abstract

Document distance is a fundamental yet significant research topic in the information retrieval community, and its accuracy dominates the performance of many text retrieval applications. Beyond the Bag-of-Words (BoW) model, the Word Mover's Distance (WMD) semantically defines the distance between documents as the minimum cost (i.e., measured by word similarities of embeddings) required to transport the words from one document to another, and it has been proven to be superior by k -nearest neighbor classification. In this article, we thoroughly study the characteristics of WMD and its relaxed versions, e.g., Relaxed WMD (RWMD) and Iterative Constrained Transfers (ICT), in various scenarios. Specifically, we concentrate on the problem of negative word similarity: the WMD family leverages all word similarities, however, most of them are meaningless, resulting in negative effects for measuring document distances. To remedy this problem, we propose Informative Similarity Filter (ISF), which retains a very small percentage of top word similarities and fixes the others as the same lower similarity. Built on it, we propose a greedy optimization (GOM) for WMD, an accurate approximation to WMD. We theoretically analyze that ISF-GOM is more applicable for relatively longer documents. Extensive experiments have been conducted to validate: 1) the problem of RWMD; 2) the effectiveness of ISF-GOM; and 3) the consistence of our analysis of ISF-GOM. Our codes and datasets are available at https://github.com/BoCheng-96/ISF-GOM.

Full Text