CaRB: A Crowdsourced Benchmark for Open IE

Sangnie Bhardwaj,Mausam Mausam,Samarth Aggarwal

doi:10.18653/v1/d19-1651

Abstract

Open Information Extraction (Open IE) systems have been traditionally evaluated via manual annotation. Recently, an automated evaluator with a benchmark dataset (OIE2016) was released – it scores Open IE systems automatically by matching system predictions with predictions in the benchmark dataset. Unfortunately, our analysis reveals that its data is rather noisy, and the tuple matching in the evaluator has issues, making the results of automated comparisons less trustworthy. We contribute CaRB, an improved dataset and framework for testing Open IE systems. To the best of our knowledge, CaRB is the first crowdsourced Open IE dataset and it also makes substantive changes in the matching code and metrics. NLP experts annotate CaRB’s dataset to be more accurate than OIE2016. Moreover, we find that on one pair of Open IE systems, CaRB framework provides contradictory results to OIE2016. Human assessment verifies that CaRB’s ranking of the two systems is the accurate ranking. We release the CaRB framework along with its crowdsourced dataset.

Highlights

Open Information Extraction (Open IE) refers to the task of forming relational tuples from sentences, without a fixed relation vocabulary (Banko et al, 2007)
We test the different Open IE systems depicted in Stanovsky and Dagan (2016), using the CaRB dataset and scorer
It can be seen that the curve for PropS lies above ClausIE at all times in OIE2016, but PropS performs the worse of all systems in CaRB

Summary

Introduction

Open Information Extraction (Open IE) refers to the task of forming relational tuples from sentences, without a fixed relation vocabulary (Banko et al, 2007). With the advent of so many systems, it is imperative to have a standardized mechanism for automatic evaluation so that they can be compared These systems have been evaluated over small manually curated gold datasets (e.g., (Fader et al, 2011; Mausam et al, 2012)). Some standard benchmarks datasets and evaluators have been proposed: OIE2016 (Stanovsky and Dagan, 2016), RelVis (Schneider et al, 2017), and Wire (Lechelle et al, 2018). These datasets are either too small or too noisy to meaningfully compare Open IE systems

Objectives

Results

Conclusion