Document Spanners: From Expressive Power to Decision Problems

Dominik D Freydenberger,Mario Holldack

doi:10.1007/s00224-017-9770-0

Dominik D Freydenberger, Mario Holldack

Open Access

https://doi.org/10.1007/s00224-017-9770-0

Copy DOI

Abstract

We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (intervals of positions of the string). We focus on document spanners that are defined by regex formulas, which are basically regular expressions that map matched subexpressions to corresponding spans, and on core spanners, which extend the former by standard algebraic operators and string equality selection. First, we compare the expressive power of core spanners to three models – namely, patterns, word equations, and a rich and natural subclass of extended regular expressions (regular expressions with a repetition operator). These results are then used to analyze the complexity of query evaluation and various aspects of static analysis of core spanners. Finally, we examine the relative succinctness of different kinds of representations of core spanners and relate this to the simplification of core spanners that are extended with difference operators.

Highlights

Information Extraction (IE) is the task of automatically extracting structured information from texts
The primitive building blocks of core spanners are regex formulas, which are regular expressions with variables. Each of these variables corresponds to a subexpression, and whenever a regex formula α matches a string w, each variable is mapped to the span in w that matches that subexpression
As we show that core spanners can recognize pattern languages, this allows us to conclude that evaluation of Boolean core spanners is NP-hard, and that spanner containment is undecidable

Summary

Introduction

Information Extraction (IE) is the task of automatically extracting structured information from texts. This paper examines document spanners ( called spanners), a formalization of the IE query language AQL, which is used in IBM’s SystemT. The primitive building blocks of core spanners are regex formulas, which are regular expressions with variables. Each of these variables corresponds to a subexpression, and whenever a regex formula α matches a string w, each variable is mapped to the span in w that matches that subexpression. Each match of α on w determines a tuple of spans; and as there can be multiple matches of a regex formula to a string, this process creates a relation over spans of w. Core spanners are defined by extending regex formulas with the relational operations projection, union, natural join, and string equality selection

Objectives

Results

Conclusion