Abstract
Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms existing methods.
Highlights
Data analytics can be defined as the process of generating useful insights from raw data sets
Constructing Probabilistic Finite-State Machines (PFSMs) for complex types might require more human engineering than the other types. We reduce this need by building such PFSMs automatically from corresponding regular expressions
To measure the performance on type/non-type inference, we report Area Under Curve (AUC) of Receiver Operating Characteristic (ROC) curves, as well as the percentages of TPs, FPs and FNs
Summary
Data analytics can be defined as the process of generating useful insights from raw data sets. Central to the entire process is the concept of data wrangling, which refers to the task of understanding, interpreting, preparing a raw data set and turning it into a usable format. This task can lead to a frustrating and time-consuming process for large data sets, and even possibly for some small sized ones (Dasu and Johnson 2003). Raw data often do not contain any well-documented prior information such as meta-data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.