Abstract
In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.
Highlights
Document retrieval from a large collection of texts is important information retrieval problem
The problem becomes even harder when we enter the field of cross-lingual document retrieval
Some tasks require to use a text as query to retrieve documents that are somehow similar to it. One of these tasks is plagiarism detection that is divided into two stages: source retrieval and text alignment
Summary
Document retrieval from a large collection of texts is important information retrieval problem. Some tasks require to use a text (possibly long) as query to retrieve documents that are somehow similar to it. One of these tasks is plagiarism detection that is divided into two stages: source retrieval and text alignment. On the source retrieval stage for a given suspicious document, we need to find all sources of probable text reuse in a large collection of texts. For this task, a source is a whole text, without details of what parts of this document were plagiarized. Given a query document in one language the goal is to find the most similar documents from the collection in another language
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Proceedings of the Institute for System Programming of the RAS
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.