Join Optimization of Information Extraction Output: Quality Matters!

Alpa Jain,Panagiotis G Ipeirotis,Anhai Doan,Luis Gravano

doi:10.1109/icde.2009.138

Abstract

Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality---and, of course, the execution time---of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Join Optimization of Information Extraction Output: Quality Matters!

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Inducing information extraction systems for new languages via cross-language projection
Ellen Riloff ... David Yarowsky
-
Ellen Riloff, et. al.Ellen Riloff ... David Yarowsky
01 Jan 2002
01 Jan 2002

Use of a Fast Information Extraction Method as a Decision Support Tool
Mahmudul Sheikh ... Sumali Conlon
Journal of International Technology and Information Management | VOL. 19
Mahmudul Sheikh, et. al.Mahmudul Sheikh ... Sumali Conlon
01 Jan 2009
Journal of International Technology and Information Management | VOL. 19

Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale.
See Boon Tay ... Ryan Shea Ying Cong Tan
JCO clinical cancer informatics | VOL. 8
See Boon Tay, et. al.See Boon Tay ... Ryan Shea Ying Cong Tan
01 May 2024
JCO clinical cancer informatics | VOL. 8

Information Extraction from the Web: System and Techniques
Luo Xiao ... Michael Brown
Applied Intelligence | VOL. 21
Luo Xiao, et. al.Luo Xiao ... Michael Brown
01 Sep 2004
Applied Intelligence | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Join Optimization of Information Extraction Output: Quality Matters!

Abstract

Talk to us

Similar Papers