Sherlock

Madelon Hulsebos,Emanuel Zgraggen,Çagatay Demiralp,Kevin Hu,Michiel Bakker,Arvind Satyanarayan,Tim Kraska,César Hidalgo

doi:10.1145/3292500.3330993

Abstract

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Sherlock

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jul 25, 2019
Citations: 119	License type: cc-by-nc-sa

Similar Papers

Sato
Dan Zhang ... Yoshihiko Suhara
Proceedings of the VLDB Endowment | VOL. 13
Dan Zhang, et. al.Dan Zhang ... Yoshihiko Suhara
01 Aug 2020
Proceedings of the VLDB Endowment | VOL. 13

Validating Multi-column Schema Matchings by Type
Bing Tian Dai ... Nick Koudas
-
Bing Tian Dai, et. al.Bing Tian Dai ... Nick Koudas
01 Apr 2008
01 Apr 2008

A Column Styled Composable Schema Matcher for Semantic Data-Types
Xiaofeng Liao ... Zhiming Zhao
Data Science Journal | VOL. 18
Xiaofeng Liao, et. al.Xiaofeng Liao ... Zhiming Zhao
24 Jun 2019
Data Science Journal | VOL. 18

Instance Based Schema Matching Framework Utilizing Google Similarity and Regular Expression
Osama A Mehdi ... Lilly Suriani Affendey
-
Osama A Mehdi, et. al.Osama A Mehdi ... Lilly Suriani Affendey
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Sherlock

Abstract

Talk to us

Similar Papers