Abstract

Data transformation is a laborious and time-consuming task for analysts. Programming by example (PBE) is a technique that can simplify this difficult task for data analysts by automatically generating programs for data transformation. Most of the previously proposed PBE methods are based on search algorithms, but recent improvements in machine learning (ML) have led to its application in PBE research. For example, RobustFill was proposed as an ML-based PBE method for string transformation by using long short-term memory (LSTM) as the sequential encoder–decoder model. However, an ML-based PBE method has not been developed for tabular transformations, which are used frequently in data analysis. Thus, in the present study, we propose an ML-based PBE method for tabular transformations. First, we consider the features of tabular transformations, which are more complex and data intensive than string transformations, and propose a new ML-based PBE method using the state-of-the-art Transformer sequential encoder–decoder model. To our knowledge, this is the first ML-based PBE method for tabular transformations. We also propose two decoding methods comprising multistep beam search and program validation-beam search, which are optimized for program generation, and thus generate correct programs with higher accuracy. Our evaluation results demonstrated that the Transformer-based PBE model performed much better than LSTM-based PBE when applied to tabular transformations. Furthermore, the Transformer-based model with the proposed decoding method performed better than the conventional PBE model using the search-based method.

Highlights

  • Previous works [3], [4] propose machine learning (ML)-based Programming by example (PBE) by leveraging long-short-term memory (LSTM), which is an encoder-decoder model mainly used for natural language processing and time series processing

  • We propose a new ML approach to achieve PBE for tabular transformation, which is required in data analysis and data integration scenarios

  • We propose decoding methods called multistep beam search, which is suitable for finding a consistent program by running beam search multiple times, and PV (Program Validation)-beam search, which realizes efficient beam search by searching only the hypothesis space that is valid as a program

Read more

Summary

INTRODUCTION

The integration of data from a variety of data sources into a unified format is a time-consuming and labor-intensive task for engineers or domain specialists. Foofah [1] is a PBE study of such tabular transformation It is built around techniques for solving search problems, i.e., finding a program that converts given input table to the output one. Previous works [3], [4] propose ML-based PBE by leveraging long-short-term memory (LSTM), which is an encoder-decoder model mainly used for natural language processing and time series processing. Their results show that an ML-based PBE can have a competitive accuracy to the conventional non-ML-based PBE, and can achieve noise robustness that a non-ML-based PBE do not have. We propose a new ML approach to achieve PBE for tabular transformation, which is required in data analysis and data integration scenarios.

PROBLEM FORMULATION
PROGRAM LINEARIZATION
Transformer-based layer2 (valid) 25
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call