Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Biqiu Li,Jiabin Wang,Xueli Liu,Antonio J Peña

doi:10.1155/2021/5916748

Abstract

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Highlights

Network resources are a huge and constantly updated ocean of information and an important channel for people to obtain information and knowledge
E existing data cleaning algorithms cannot meet the actual needs in cleaning efficiency and accuracy. e cleaning of Chinese similar duplicate data includes the cleaning algorithm based on literal similarity and the cleaning algorithm based on semantic similarity
Literal similarity cannot distinguish data with the same semantics but with the different font, so it is difficult to be applied to the processing of Chinese data. e existing algorithms based on semantic similarity cannot effectively clean out all similar duplicate records because of the loss of important information in the vectorization process

Summary

Introduction

Network resources are a huge and constantly updated ocean of information and an important channel for people to obtain information and knowledge. E existing algorithms based on semantic similarity cannot effectively clean out all similar duplicate records because of the loss of important information in the vectorization process. Is paper obtains the text vector through the parallel design of the BERT language model, calculates the distance between the texts, and cleans out the similar duplicate data through k-means clustering. (i) Aiming at the phenomenon of synonyms and polysemy in Chinese, the BERT model is used to reduce the loss of original semantic information in the process of text to vector. (ii) e idea of clustering is used to realize parallel cleaning of similar duplicate data.

Related Work

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Programming	Publication Date: Dec 23, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming

Lead the way for us

Similar Papers

The Parallel Implementation and Application of an Improved K-means Algorithm
...
电子科技大学学报 | VOL. 46
, et. al. ...
01 Jan 2017
电子科技大学学报 | VOL. 46

Associations between hoof shape and the position of the frontal plane ground reaction force vector in walking horses
Gr Colborne ... E Busschers
New Zealand Veterinary Journal | VOL. 64
Gr Colborne, et. al.Gr Colborne ... E Busschers
30 Jul 2015
New Zealand Veterinary Journal | VOL. 64

GPU Accelerated K Means Clustering Refined using ANT Colony Optimization
V Saveetha ... P D R Vijaya Kumar
Asian Journal of Research in Social Sciences and Humanities | VOL. 6
V Saveetha, et. al.V Saveetha ... P D R Vijaya Kumar
01 Jan 2015
Asian Journal of Research in Social Sciences and Humanities | VOL. 6

Polyseme-Aware Vector Representation for Text Classification
Shun Guo ... Nianmin Yao
IEEE Access | VOL. 8
Shun Guo, et. al.Shun Guo ... Nianmin Yao
01 Jan 2020
IEEE Access | VOL. 8

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Programming