Hybrid Metadata Classification in Large-scale Structured Datasets

Sophie Pavia,Anna Pyayt,Michael Gubanov,Kazi Islam,Nick Piraino

doi:10.26421/jdi3.4-4

Sophie Pavia, Anna Pyayt + Show 3 more

Open Access

PDF Available

https://doi.org/10.26421/jdi3.4-4

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Metadata location and classification is an important problem for large-scale structured datasets. For example, Web tables \cite{wt_corpus} have hundreds of millions of tables, but often have missing or incorrect labels for rows (or columns) with attribute names. Such errors \cite{wtitles} significantly complicate all data management tasks such as {\em query processing, data integration, indexing}, etc. Different sources or authors position metadata rows/columns differently inside a table, which makes its reliable identification challenging.In this work we describe our scalable, hybrid two-layer Deep- and Machine-learning based ensemble, combining Long Short Term Memory (LSTM) and Naive Bayes Classifier to accurately identify Metadata-containing rows or columns in a table. We have performed an extensive evaluation on several datasets, including an ultra large-scale dataset containing more than 15 million tables coming from more than 26 thousands of sources to justify scalability and resistance to variety, stemming from a large number of sources. We observed superiority of this two-layer ensemble, compared to the recent previous approaches and report an impressive 95.73\text{\%} accuracy at scale with our ensemble model using regular LSTM.

Full Text