Dataset retrieval system based on automation of data preparation with dataset description model

Jonghyeok Mun,Jongsun Choi,Kitae Bae,Jaeyoung Choi,Sanghwan Lee

doi:10.1002/cpe.5288

Abstract

SummaryData preparation is the most effortful task in the process of statistical learning. Many studies related to data mining are performed without data preparation by assuming that qualified datasets are already prepared. It may hide useful patterns of data, which can result in poor performance and incorrect learning. Automation of data preparation can solve these problems. For automation of data preparation, a few issues should be considered, such as flexible expression of requirements according to the purpose of the learning model, accessibility to data sources, and performance degradation due to automation. In this paper, we propose a dataset description model that can express the requirements for data processing and dataset retrieval system based on automated data preparation. The proposed system makes it possible to provide good quality datasets for statistical learning applications using data preparation methods such as data acquisition, refinement, and organization. In the experiment, we demonstrate that the proposed system doesn't have performance loss as compared to the existing manual systems. Moreover, the quality of the datasets are also improved by using the proposed system.

Full Text