Abstract

Document similarity is one among the most significant problems in knowledge discovery and information retrieval. Most of the works in document similarity only focus on textual content of the documents. However, these similarity measures do not provide an accurate measure. An alternative is to incorporate citation information into similarity measure. The content of a document can be improved by considering the content of cited documents, which is the key behind this alternative. In this work, citation network analysis is used to expand the content of citing document by including the information given in cited documents. The next issue is the representation of documents. A commonly used document representation is bag-of-words model. But it does not capture the meaning or semantics of the text as well as the ordering of the words. Hence, this proposed work uses word embedding representation. Word embedding represents a word as a dense vector with low dimensionality. Word2vec model is used to generate word embedding which can capture contextual similarity between words. The similarity between documents is measured using word mover’s distance, which is based on the word embedding representation of words. The proposed work takes advantage of both textual similarity and contextual similarity. Experiments showed that the proposed method provides better results compared to other state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.