Abstract

We propose methods for extracting triples from Wikipedia’s HTML tables using a reference knowledge graph. Our methods use a distant-supervision approach to find existing triples in the knowledge graph for pairs of entities on the same row of a table, postulating the corresponding relation for pairs of entities from other rows in the corresponding columns, thus extracting novel candidate triples. Binary classifiers are applied on these candidates to detect correct triples and thus increase the precision of the output triples. We extend this approach with a preliminary step where we first group and merge similar tables, thereafter applying extraction on the larger merged tables. More specifically, we propose an <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">observed schema</i> for individual tables, which is used to group and merge tables. We compare the precision and number of triples extracted with and without table merging, where we show that with merging, we can extract a larger number of triples at a similar precision. Ultimately, from the tables of English Wikipedia, we extract 5.9 million novel and unique triples for Wikidata at an estimated precision of 0.718.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call