Abstract

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Highlights

  • Network resources are a huge and constantly updated ocean of information and an important channel for people to obtain information and knowledge

  • E existing data cleaning algorithms cannot meet the actual needs in cleaning efficiency and accuracy. e cleaning of Chinese similar duplicate data includes the cleaning algorithm based on literal similarity and the cleaning algorithm based on semantic similarity

  • Literal similarity cannot distinguish data with the same semantics but with the different font, so it is difficult to be applied to the processing of Chinese data. e existing algorithms based on semantic similarity cannot effectively clean out all similar duplicate records because of the loss of important information in the vectorization process

Read more

Summary

Introduction

Network resources are a huge and constantly updated ocean of information and an important channel for people to obtain information and knowledge. E existing algorithms based on semantic similarity cannot effectively clean out all similar duplicate records because of the loss of important information in the vectorization process. Is paper obtains the text vector through the parallel design of the BERT language model, calculates the distance between the texts, and cleans out the similar duplicate data through k-means clustering. (i) Aiming at the phenomenon of synonyms and polysemy in Chinese, the BERT model is used to reduce the loss of original semantic information in the process of text to vector. (ii) e idea of clustering is used to realize parallel cleaning of similar duplicate data.

Related Work
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.