Web-scale extraction of structured data

Michael J Cafarella,Alon Halevy,Jayant Madhavan

doi:10.1145/1519103.1519112

Michael J Cafarella, Alon Halevy + Show 1 more

Open Access

https://doi.org/10.1145/1519103.1519112

Copy DOI

Abstract

A long-standing goal of Web research has been to construct a unified Web knowledge base. Information extraction techniques have shown good results on Web inputs, but even most domain-independent ones are not appropriate for Web-scale operation. In this paper we describe three recent extraction systems that can be operated on the entire Web (two of which come from Google Research). The TextRunner system focuses on raw natural language text, the WebTables system focuses on HTML-embedded tables, and the deep-web surfacing system focuses on "hidden" databases. The domain, expressiveness, and accuracy of extracted data can depend strongly on its source extractor; we describe differences in the characteristics of data produced by the three extractors. Finally, we discuss a series of unique data applications (some of which have already been prototyped) that are enabled by aggregating extractedWeb information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Web-scale extraction of structured data

Abstract

Talk to us

Similar Papers

More From: ACM SIGMOD Record

Lead the way for us

Journal: ACM SIGMOD Record	Publication Date: Mar 20, 2009
Citations: 92

Similar Papers

Limitations of information extraction methods and techniques for heterogeneous unstructured big data
Kiran Adnan ... Rehan Akbar
International Journal of Engineering Business Management | VOL. 11
Kiran Adnan, et. al.Kiran Adnan ... Rehan Akbar
01 Jan 2019
International Journal of Engineering Business Management | VOL. 11

Information Extraction with Humans in the Loop
Anna Lisa Gentile
-
Anna Lisa GentileAnna Lisa Gentile
13 May 2019
13 May 2019

Uncertainty Reduction for Knowledge Discovery and Information Extraction on the World Wide Web
Heng Ji ... Hongbo Deng
Proceedings of the IEEE | VOL. 100
Heng Ji, et. al.Heng Ji ... Hongbo Deng
01 Sep 2012
Proceedings of the IEEE | VOL. 100

The effects of high quality translations of named entities in cross-language information exploration
Dan Wu ... Ralph Grishman
-
Dan Wu, et. al.Dan Wu ... Ralph Grishman
01 Oct 2008
01 Oct 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Web-scale extraction of structured data

Abstract

Talk to us

Similar Papers

More From: ACM SIGMOD Record