Abstract

In order to extract entities of a fine-grained category from semi-structured data in web pages, existing information extraction systems rely on seed examples or redundancy across multiple web pages. In this paper, we consider a new zero-shot learning task of extracting entities specified by a natural language query (in place of seeds) given only a single web page. Our approach defines a log-linear model over latent extraction predicates, which select lists of entities from the web page. The main challenge is to define features on widely varying candidate entity lists. We tackle this by

Highlights

  • We consider the task of extracting entities of a given category from web pages

  • We propose a novel task, zeroshot entity extraction, where the specification of the desired entities is provided as a natural language query

  • Our work shares a base with the wrapper induction literature (Kushmerick, 1997) in that it leverages regularities of web page structures

Read more

Summary

Introduction

We consider the task of extracting entities of a given category (e.g., hiking trails) from web pages. Previous approaches either (i) assume that the same entities appear on multiple web pages, or (ii) require information such as seed examples (Etzioni et al, 2005; Wang and Cohen, 2009; Dalvi et al, 2012) These approaches work well for common categories but encounter data sparsity problems for more specific categories, such as the products of a small company or the dishes at a local restaurant. In this context, we may have only a single web page that contains the information we need and no seed examples.

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call