Abstract

The purpose of cross-language textual similarity detection is to approximate the similarity of two textual units in different languages. This paper embeds the distributed representation of words in cross-language textual similarity detection using word embedding and IDF. The paper introduces a novel cross-language plagiarism detection approach constructed with the distributed representation of words in sentences. To improve the textual similarity of the approach, a novel method is used called CL-CTS-CBOW. Consequently, adding the syntax feature to the approach is improved by a novel method called CL-WES. Afterward, the approach is improved by the IDF weighting method. The corpora used in this study are four Arabic-English corpora, specifically books, Wikipedia, EAPCOUNT, and MultiUN, which have more than 10,017,106 sentences and uses with supported parallel and comparable assemblages. The proposed method in this paper combines different methods to confirm their complementarity. In the experiment, the proposed system obtains 88% English-Arabic similarity detection at the word level and 82.75% at the sentence level with various corpora.

Highlights

  • IntroductionEarlier studies have used approaches such as crosslingual explicit semantic analysis (CL-ESA), syntactic alignment using character N-grams (CL-CNG), dictionaries and thesauruses, statistical machine translation, online machine translators [1] [6], and more recently, semantic networks and word embedding [7]

  • We explore the performance of the distributed representation of word embedding to propose novel crosslingual similarity procedures for similarity detection

  • Focusing on the state-of-the-art methods, we found that the best performance is from the CL-ASA method at the word and sentence levels, but the overall performance of the method is lower than the CL-WES performance, which is the best single method evaluated

Read more

Summary

Introduction

Earlier studies have used approaches such as crosslingual explicit semantic analysis (CL-ESA), syntactic alignment using character N-grams (CL-CNG), dictionaries and thesauruses, statistical machine translation, online machine translators [1] [6], and more recently, semantic networks and word embedding [7]. These approaches are specific to bilingual plagiarism detection tasks and are normally not sufficient for limited resource languages. Word embedding is a significant representation theory used to represent sentence units used in natural language processing (NLP) applications [15]. The GloVe word embedding model uses a global vector for word representation [21]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call