Abstract

The methods for identification of near-duplicates in electronic scientific papers, which include the content of the same type, for example, text data, mathematical formulas, numerical data, etc. were described. For text data, the method of locally sensitive hashing with the finding of Hamming distance between the elements of indices of electronic scientific papers was formalized. If Hamming distance exceeds a fixed numerical threshold, a scientific paper contains a near-duplicate. For numerical data, sub-sequences for each scientific work are formed and the proximity between the papers is determined as the Euclidian distance between the vectors consisting of the numbers of these sub-sequences. To compare mathematical formulas, the method for comparing the sample of formulas is used and the names of variables are compared. To identify near-duplicates in graphic information, two directions are separated: finding key points in the image and applying locally sensitive hashing for individual pixels of the image. Since scientific papers often include such objects as schemes and diagrams, subscriptions to them are examined separately using the methods for comparing text information. The combined method for identification of near-duplicates in electronic scientific papers, which combines the methods for identification of near-duplicates of various types of data, was proposed. To implement the combined method for the identification of near-duplicates in electronic scientific papers, an information-analytical system that processes scientific materials depending on the content type was devised. This makes it possible to qualitatively identify near-duplicates and as widely as possible identify possible abuses and plagiarism in electronic scientific papers: scientific articles, dissertations, monographs, conference materials, etc.

Highlights

  • The task of analyzing the content of electronic scientific papers for the identification of near-duplicates is relevant for professional scientific publications, specialized academic councils for the presentation of dissertations, and the scientific community in general

  • For the qualitative identification of near-duplicates, the data of all types must be analyzed for similarities using various methods that are best suited for analysis

  • The purpose of this study is to develop a combined me­ thod for identifying near-duplicates in electronic scientific papers, taking into consideration the data of various types

Read more

Summary

Introduction

The task of analyzing the content of electronic scientific papers for the identification of near-duplicates is relevant for professional scientific publications, specialized academic councils for the presentation of dissertations, and the scientific community in general. Improving the methods for identifying near-duplicates of scientific papers is an important tool for preventing abuse and plagiarism in the field of higher education and ensures academic integrity. The problem of identifying near-duplicates is not easy, since electronic scientific works can contain data of different types: texts, mathematical formulas, tables, schemes and diagrams, pictures, numerical data, etc. For the qualitative identification of near-duplicates, the data of all types must be analyzed for similarities using various methods that are best suited for analysis. That is why there is a problem of devising a combined method for identifying near-duplicates in scientific papers, taking into consideration data of various types.

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call