Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method

Petro Lizunov,Svitlana Biloshchytska,Andrii Biloshchytskyi,Larysa Chala,Alexander Kuchansky

doi:10.15587/1729-4061.2016.86243

Abstract

A hybrid method for the detection of near duplicates in tables is proposed. This method allows the identification of similarities between text and numeric data of tables separately, and then it generalized the results obtained. For the text data, sequences of words are formed in the canonized form, from which, based on the method of locality-sensitive hashing, the bit sequences are constructed. A similarity between data in this case is determined by the Hamming distance at the assigned threshold value. The identification of similarities between numeric data in tables is implemented based on the method of the nearest neighbours with assigned metric distances. The method makes it possible to identify near duplicates, present in data in the input table, relative to a set of tables, which are selected from the scientific publications and dissertations and theses papers. It should be noted that the method is designed for finding near duplicates in tables that contain only text and numeric data. In the case of availability in the content of examined tables of pictures and formulas, these objects are examined separately by using specific methods. The method proposed might be implemented in the systems that are intended for running intelligent analysis of information represented by text and tables to identify similarities and detect near-duplicates, in particular, antiplagiarism-systems.

Highlights

A table is understood as an arrangement of various types of data in rows and columns, or in the form of a more complex structure
The problem of finding near duplicates in numerical data is related to the identification of similarities in the time series based on the comparison to the sample using the method of nearest neighbors with assigned metric [13, 14]
As a result of the research, we described and formalized a hybrid method for the detection of near-duplicates based on the method of locality-sensitive hashing and the nearest neighbor method

Summary

Introduction

A table is understood as an arrangement of various types of data in rows and columns, or in the form of a more complex structure. The usual term “column” is often interpreted as field, attribute or property that has a specified title or a name. This name may a priori consist of a word or a sequence of words, be presented by numeric values, a formula (formulas), or a date. If the original table contains results of numerical experiment, in the borrowing these numeric data may be deliberately changed. All this considerably complicates comparing tables for the identification of near-duplicates. The method proposed in the article might be used for the development of a module of software package to detect near-duplicates in thesis and diploma papers, as well as in the scientific publications

Literature review and problem statement

The purpose and tasks of the study

The types of data that are represented in table cells

Model of indexing the table data

A hybrid method for detecting near-duplicates in tables

Represent subsequences

Discussion of results of research into detection of near-duplicates in tables

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Eastern-European Journal of Enterprise Technologies	Publication Date: Dec 27, 2016
Citations: 16	License type: cc-by

R Discovery Prime

R Discovery Prime

Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies

Lead the way for us

Similar Papers

Preserving-Ignoring Transformation Based Index for Approximate k Nearest Neighbor Search
Gang Hu ... Dongxiang Zhang
-
Gang Hu, et. al.Gang Hu ... Dongxiang Zhang
01 Apr 2017
01 Apr 2017

Fast and accurate Nearest Neighbor search in the manifolds of symmetric positive definite matrices
Ligang Zheng ... Jiang Duan
-
Ligang Zheng, et. al.Ligang Zheng ... Jiang Duan
01 May 2014
01 May 2014

A PG-LSH Similarity Search Method for Cloud Storage
Jie Zheng ... Jun Luo
-
Jie Zheng, et. al.Jie Zheng ... Jun Luo
01 Dec 2013
01 Dec 2013

A comparison of similarity based instance selection methods for cross project defect prediction
Seyedrebvar Hosseini ... Burak Turhan
-
Seyedrebvar Hosseini, et. al.Seyedrebvar Hosseini ... Burak Turhan
22 Mar 2021
22 Mar 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detection of near dublicates in tables based on the locality-sensitive hashing method and the nearest neighbor method

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Eastern-European Journal of Enterprise Technologies