A Document Similarity Computation Method Based on Word Embedding and Citation Analysis

K Lamiya,Anuraj Mohan

doi:10.1007/978-981-10-8633-5_17

Abstract

Document similarity is one among the most significant problems in knowledge discovery and information retrieval. Most of the works in document similarity only focus on textual content of the documents. However, these similarity measures do not provide an accurate measure. An alternative is to incorporate citation information into similarity measure. The content of a document can be improved by considering the content of cited documents, which is the key behind this alternative. In this work, citation network analysis is used to expand the content of citing document by including the information given in cited documents. The next issue is the representation of documents. A commonly used document representation is bag-of-words model. But it does not capture the meaning or semantics of the text as well as the ordering of the words. Hence, this proposed work uses word embedding representation. Word embedding represents a word as a dense vector with low dimensionality. Word2vec model is used to generate word embedding which can capture contextual similarity between words. The similarity between documents is measured using word mover’s distance, which is based on the word embedding representation of words. The proposed work takes advantage of both textual similarity and contextual similarity. Experiments showed that the proposed method provides better results compared to other state-of-the-art methods.

Full Text