Improving the Usability of Tabular Data Through Data Annotation, Repair and Augmentation

Rabeb Abida,Anthony Cleve

doi:10.1007/978-3-031-17030-0_6

Abstract

AbstractIn recent years, a rapidly increasing amount of information has been made publicly available in tabular form on the Web. Many of these data are not usable due to their poor quality (e.g., misspelled or missing values, missing or incomplete metadata, and missing meaningful columns). Solutions have been proposed in the literature to address these data quality issues, but there is still a lack of all-in-one approaches that can fully solve them. Therefore, users need to use several methods to solve these data quality issues. In this paper, we present an all-in-one and automatic approach called SINATRA that helps to bridge this gaps by providing the following features: data annotation (to address misspelled and incomplete metadata issues), data repair (to address missing values (data) issues), and data augmentation (to dynamically add meaningful columns and corresponding cell values to the dataset). An evaluation of the SINATRA approach based on datasets from a state-of-the-art benchmark shows promising results in terms of F1-measure and precision.

Full Text