Abstract

This research aims to analyze the patterns of data errors in order to fulfill the data required for household big data development at the sub-district level in Thailand. Feature Selection and Multi-Layer Perceptron Neural Network were applied, while the data imbalance was solved by the SMOTE method and the comparison between the CFS feature selection method and Information Gain (IG) feature selection method. Afterward, the datasets were classified the data errors by the Multi-Layer Perceptron Neural Network. Each model’s effectiveness was measured by the 10-fold cross-validation method. The research results revealed that the suitable data size after being adjusted data imbalanced was 400%. Once the data had been processed for developing the model, it was found that after being adjusted data size towards the application of the SMOTE, CFS feature selection technique, and classified data errors by the Multi-Layer Perceptron Neural Network, the model provided the highest level of effectiveness in data errors classification with an accuracy of 98.29 %. Moreover, the application could effectively classify data errors and display the household big data at the highest level. The application evaluation results given by the experts and the users had an average mean of 4.69 and higher, a standard deviation of 0.47 and lower, which has the level of effectiveness of 93.78% and higher, while interquartile range values not over 1, a quartile deviation of no more than 0.5.

Highlights

  • The development of big data in the field of health, economics, environment, activities, developments, and household demographics is crucial for community development

  • Community demographics are considered a big data prototype linked with the national big data system, facilitating the data processing cycle and reflecting the genuine problems embedded in the data

  • Dealing with data errors is challenging for big data, including missing data, incorrect input, typo error, inconsistent data, or violated attribute dependency

Read more

Summary

Introduction

The development of big data in the field of health, economics, environment, activities, developments, and household demographics is crucial for community development This is because comprehensive and accurate data can demonstrate the community’s genuine problems and demands in which the governmental agencies or responsible figures such as village leaders, subdistrict administrators, local people themselves, researchers, and the business sector can take advantage to solve the problems. One of the most common problems while collecting community data is that the local people are hesitant to provide information. Even though both public and private sectors have tried to collect data from the local communities, local people rarely understand the overall picture because the analyzed data has not been accessible for the local people They are reluctant to provide further information. Governmental agencies, researchers, and the business sector can make use of this information for supporting and developing the communities in the future

Synthetic minority over-sampling technique
Feature selection
Multi-layer perceptron neural network
Literature review
Methodology
Data preprocessing
Handling imbalanced data by SMOTE
Feature selection by CFS and IG
Model creation by multi-layer perceptron neural network
Effectiveness evaluation of the model
Development and deployment of the application
Research results
Effectiveness evaluation results of the application
Conclusion
Findings
Author
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call