Ranking Deep Web Text Collections for Scalable Information Extraction

Pablo Barrio,Chris Develder,Luis Gravano

doi:10.1145/2806416.2806581

Abstract

Information extraction (IE) systems discover structured information from natural language text, to enable much richer querying and data mining than possible directly over the unstructured text. Unfortunately, IE is generally a computationally expensive process, and hence improving its efficiency, so that it scales over large volumes of text, is of critical importance. State-of-the-art approaches for scaling the IE process focus on one text collection at a time. These approaches prioritize the extraction effort by learning keyword queries to identify the "useful" documents for the IE task at hand, namely, those that lead to the extraction of structured "tuples." These approaches, however, do not attempt to predict which text collections are useful for the IE task---and hence merit further processing---and which ones will not contribute any useful output---and hence should be ignored altogether, for efficiency. In this paper, we focus on an especially valuable family of text sources, the so-called deep web collections, whose (remote) contents are only accessible via querying. Specifically, we introduce and study techniques for ranking deep web collections for an IE task, to prioritize the extraction effort by focusing on collections with substantial numbers of useful documents for the task. We study both (adaptations of) state-of-the-art resource selection strategies for distributed information retrieval, and IE-specific approaches. Our extensive experimental evaluation over realistic deep web collections, and for several different IE tasks, shows the merits and limitations of the alternative families of approaches, and provides a roadmap for addressing this critically important building block for efficient, scalable information extraction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Ranking Deep Web Text Collections for Scalable Information Extraction

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Oct 17, 2015
Citations: 39	License type: other-oa

Similar Papers

InfoXtract
Rohini K Srihari ... Wei Li
-
Rohini K Srihari, et. al.Rohini K Srihari ... Wei Li
01 Jan 2003
01 Jan 2003

Sampling strategies for information extraction over the deep web
Pablo Barrio ... Luis Gravano
Information Processing & Management | VOL. 53
Pablo Barrio, et. al.Pablo Barrio ... Luis Gravano
06 Dec 2016
Information Processing & Management | VOL. 53

InfoXtract: A customizable intermediate level information extraction engine
Rohini K Srihari ... Thomas Cornell
Natural Language Engineering | VOL. 14
Rohini K Srihari, et. al.Rohini K Srihari ... Thomas Cornell
09 Jun 2006
Natural Language Engineering | VOL. 14

Ontology-Based Information Extraction from Free-Form Text
Ronald Braun
-
Ronald BraunRonald Braun
06 Oct 2000
06 Oct 2000

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Ranking Deep Web Text Collections for Scalable Information Extraction

Abstract

Talk to us

Similar Papers