Devising an entropy-based approach for identifying patterns in multilingual texts

Gulnur Yerkebulan,Vladimir Kulikov,Valentina Kulikova,Zaru Kulsharipova

doi:10.15587/1729-4061.2021.228695

Abstract

Even though the plagiarism identification issue remains relevant, modern detection methods are still resource-intensive. This paper reports a more efficient alternative to existing solutions. The devised system for identifying patterns in multilingual texts compares two texts and determines, by using different approaches, whether the second text is a translation of the first or not. This study's approach is based on Renyi entropy. The original text from an English writer's work and five texts in the Russian language were selected for this research. The real and "fake" translations that were chosen included translations by Google Translator and Yandex Translator, an author's book translation, a text from another work by an English writer, and a fake text. The fake text represents a text compiled with the same frequency of keywords as in the authentic text. Upon forming a key series of high-frequency words for the original text, the relevant key series for other texts were identified. Then the entropies for the texts were calculated when they were divided into "sentences" and "paragraphs". A Minkowski metric was used to calculate the proximity of the texts. It underlies the calculations of a Hamming distance, the Cartesian distance, the distance between the centers of masses, the distance between the geometric centers, and the distance between the centers of parametric means. It was found that the proximity of texts is best determined by calculating the relative distances between the centers of parametric means (for "fake" texts ‒ exceeding 3, for translations ‒ less than 1). Calculating the proximity of texts by using the algorithm based on Renyi entropy, reported in this work, makes it possible to save resources and time compared to methods based on neural networks. All the raw data and an example of the entropy calculation on php are publicly available

Highlights

Plagiarism detection is still a pressing issue, especially with the advent of websites that automatically generate texts, as well as such translator websites that enable translating from one language to another while making changes to the original text
The benefits of neural networks include problem-solving under unknown patterns, the resistance to noisy input data, potential ultra-high performance, as well as failure-free operation in the hardware implementation of a neural network [1]
It is noted that the authorized translation (RuAuth), unlike the original text (En), uses frequency words in a different way in the interval of calculations for sequences of 1‒10 words, which is shown in Fig. 1, so we calculated the Hamming distance for sequences of 11‒20 words (Table 2)

Summary

Introduction

Plagiarism detection is still a pressing issue, especially with the advent of websites that automatically generate texts, as well as such translator websites that enable translating from one language to another while making changes to the original text. Various methods are used to detect plagiarism, among which neural network-based techniques are rapidly evolving. Neural networks are used wherever one wants to solve prediction, classification, or management tasks. The benefits of neural networks include problem-solving under unknown patterns, the resistance to noisy input data, potential ultra-high performance, as well as failure-free operation in the hardware implementation of a neural network [1]. In addition to determining a borrowing, it could be possible to find the original sources of news and articles, regardless of language

Literature review and problem statement

The aim and objectives of the study

Materials and methods to study patterns in polylingual texts

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Devising an entropy-based approach for identifying patterns in multilingual texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies

Lead the way for us

Journal: Eastern-European Journal of Enterprise Technologies	Publication Date: Apr 30, 2021
License type: CC BY 4.0

Similar Papers

“Please Let me Use Google Translate”: Thai EFL Students’ Behavior and Attitudes toward Google Translate Use in English Writing
Wichuta Chompurach
English Language Teaching | VOL. 14
Wichuta ChompurachWichuta Chompurach
16 Nov 2021
English Language Teaching | VOL. 14

The Longman Anthology of Old English, Old Icelandic, and Anglo-Norman Literatures ed. by Richard North, Joe Allard, and Patricia Gillies
Larry Swain
Arthuriana | VOL. 26
Larry SwainLarry Swain
01 Jan 2015
Arthuriana | VOL. 26

Longman Anthology of Old English, Old Icelandic, and Anglo-Norman Literatures
Richard North ... Joe Allard
-
Richard North, et. al.Richard North ... Joe Allard
23 Apr 2014
23 Apr 2014

AI 기반 영작문 학습도구에 대한 대학생 학습자 인식: Google Translate, Naver Papago, 그리고 Grammarly를 중심으로
Hye-Kyung Kim ... Sumi Han
Modern English Education | VOL. 22
Hye-Kyung Kim, et. al.Hye-Kyung Kim ... Sumi Han
30 Nov 2021
Modern English Education | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Devising an entropy-based approach for identifying patterns in multilingual texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies