Abstract

Some dirty data exists inevitably under big data environment, and it seriously affects the data quality, while the technology of data cleaning is one of the most important methods to improve data quality, and the researches on the data cleaning framework are helpful for big data decision. A general framework of data cleaning in big data is proposed, the core data cleaning module includes three submodules, which are incomplete records cleaning, inconsistent data repairing and approximate duplicate records cleaning, and the processes of data cleaning are discussed specifically. The character of big data is volume, variety, value, velocity and complexity, and there are some incomplete, incorrect and duplicate dirty data in original information, which cause the big data uncontrollable and unavailable[1-2]. It is hoped that valuable information can be extracted from the mass data to provide reference for decision makers. Because of error in data merging or migration of dada sources, it is unavoidable to exist some redundant, incomplete, indeterminable and inconsistent data, which is called dirty data and affects seriously the efficiency of data utilization and the quality of decision making. The technology of data cleaning is particularly important to make the data more accurate and consistent, and it can filter or modify the unnecessary data and output the required data. At present, there are some researches on the data cleaning for big data[3-6]. The technology of big data is developed from the traditional technology, and inherits the traditional concepts and analysis methods[7-8], such as data cleaning and data warehouse. The traditional data cleaning can provide high quality data and enhance efficiency and correctness of data analysis. In big data environment, data cleaning is the basis and original process of big data analysis, which decides the data quality of results. The technology of data cleaning in big data is discussed in this paper, and a general framework of data cleaning is proposed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call