Abstract

Metadata location and classification is an important problem for large-scale structured datasets. For example, Web tables \cite{wt_corpus} have hundreds of millions of tables, but often have missing or incorrect labels for rows (or columns) with attribute names. Such errors \cite{wtitles} significantly complicate all data management tasks such as {\em query processing, data integration, indexing}, etc. Different sources or authors position metadata rows/columns differently inside a table, which makes its reliable identification challenging.In this work we describe our scalable, hybrid two-layer Deep- and Machine-learning based ensemble, combining Long Short Term Memory (LSTM) and Naive Bayes Classifier to accurately identify Metadata-containing rows or columns in a table. We have performed an extensive evaluation on several datasets, including an ultra large-scale dataset containing more than 15 million tables coming from more than 26 thousands of sources to justify scalability and resistance to variety, stemming from a large number of sources. We observed superiority of this two-layer ensemble, compared to the recent previous approaches and report an impressive 95.73\text{\%} accuracy at scale with our ensemble model using regular LSTM.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call