A clustering approach to extract data from HTML tables

Patricia Jiménez,Juan C Roldán,Rafael Corchuelo

doi:10.1016/j.ipm.2021.102683

Patricia Jiménez, Juan C Roldán + Show 1 more

Open Access

https://doi.org/10.1016/j.ipm.2021.102683

Copy DOI

Journal: Information Processing and Management	Publication Date: Aug 13, 2021
Citations: 7	License type: cc-by-nc-nd

Affiliation: Universidad de Sevilla

Abstract

HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiency.

Full Text