Abstract

Our goal is to construct a domain-targeted, high precision knowledge base (KB), containing general (subject,predicate,object) statements about the world, in support of a downstream question-answering (QA) application. Despite recent advances in information extraction (IE) techniques, no suitable resource for our task already exists; existing resources are either too noisy, too named-entity centric, or too incomplete, and typically have not been constructed with a clear scope or purpose. To address these, we have created a domain-targeted, high precision knowledge extraction pipeline, leveraging Open IE, crowdsourcing, and a novel canonical schema learning algorithm (called CASI), that produces high precision knowledge targeted to a particular domain - in our case, elementary science. To measure the KB’s coverage of the target domain’s knowledge (its “comprehensiveness” with respect to science) we measure recall with respect to an independent corpus of domain text, and show that our pipeline produces output with over 80% precision and 23% recall with respect to that target, a substantially higher coverage of tuple-expressible science knowledge than other comparable resources. We have made the KB publicly available.

Highlights

  • While there have been substantial advances in knowledge extraction techniques, the availability of high precision, general knowledge about the world, limited coverage of general knowledge (e.g., FreeBase and NELL primarily contain knowledge about Named Entities; WordNet uses only a few (< 10) semantic relations) low precisionOur goal in this work is to create a domain-targeted knowledge extraction pipeline that can overcome these limitations and output a high precision knowledge base (KB) of triples relevant to our end task

  • In the automatic KB construction literature, while a KB’s size is often reported, this does not reveal whether the KB is near-complete or merely a drop in the ocean of that required (Razniewski et al, 2016; Stanovsky and Dagan, 2016)

  • We define comprehensiveness as: recall at high (> 80%) precision of domainrelevant facts. This measure is similar to recall at the point P=80% on the PR curve, except recall is measured with respect to a different distribution of facts rather than a held-out sample of data used to build the KB

Read more

Summary

Introduction

While there have been substantial advances in knowledge extraction techniques, the availability of high precision, general knowledge about the world, limited coverage of general knowledge (e.g., FreeBase and NELL primarily contain knowledge about Named Entities; WordNet uses only a few (< 10) semantic relations) low precision (e.g., many ConceptNet assertions express idiosyncratic rather than general knowledge)Our goal in this work is to create a domain-targeted knowledge extraction pipeline that can overcome these limitations and output a high precision KB of triples relevant to our end task. While there have been substantial advances in knowledge extraction techniques, the availability of high precision, general knowledge about the world,. Limited coverage of general knowledge (e.g., FreeBase and NELL primarily contain knowledge about Named Entities; WordNet uses only a few (< 10) semantic relations). Our goal in this work is to create a domain-targeted knowledge extraction pipeline that can overcome these limitations and output a high precision KB of triples relevant to our end task. We present a high precision extraction pipeline able to extract (subject,predicate,object) tuples relevant to a domain with precision in excess of 80%. The input to the pipeline is a corpus, a sensedisambiguated domain vocabulary, and a small set of entity types. The pipeline uses a combination of text filtering, Open IE, Turker annotation on samples, and precision prediction to generate its output

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call