Benchmarks and goals

Kenneth Ward Church

doi:10.1017/s1351324920000418

Abstract

AbstractBenchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC). Many KGC algorithms have been proposed such as Trans[DEHRM], but it remains to be seen how this work improves WordNet coverage. Given how much work is based on these benchmarks, the literature should have more to say than it does about the connection between benchmarks and goals. Is optimizing P@10 on WN18RR likely to produce more complete knowledge graphs? Is MUSE likely to improve Machine Translation?

Highlights

Many papers in top conferences these days propose methods and test them on standard benchmarks such as General Language Understanding Evaluation (GLUE)a (Wang et al 2018), Multilingual Unsupervised and Supervised Embeddings (MUSE)b (Conneau et al 2017), and wordnet 18 reduced relations (WN18RR) (Dettmers et al 2018)
Benchmarks have a way of taking on a life of their own
Before discussing some of the history behind MUSE and WN18RR, benchmarks for bilingual lexicon induction (BLI) and knowledge graph completion (KGC), it is useful to say a few words about goals

Summary

Introduction

Many (perhaps most) papers in top conferences these days propose methods and test them on standard benchmarks such as General Language Understanding Evaluation (GLUE)a (Wang et al 2018), Multilingual Unsupervised and Supervised Embeddings (MUSE)b (Conneau et al 2017), and wordnet 18 reduced relations (WN18RR) (Dettmers et al 2018). Some of these methods (Bidirectional Encoder Representations from Transformers (BERT) Devlin et al 2019, enhanced representation through knowledge integration (ERINE) Sun et al 2020) do well on benchmarks and do well on tasks that we care about.c despite large numbers of papers with promising performance on benchmarks, there is remarkably little evidence of generalizations beyond benchmarks. This history is quickly forgotten as attention moves to SOTA numbers, and away from sensible (credible and worthwhile) motivations

The goal

Background

F where

WordNet is incomplete because it is too English-centric

Comparisons of WordNet and MUSE

WordNet as a database

History and motivation for BLI and MUSE benchmark

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Natural Language Engineering	Publication Date: Aug 10, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Benchmarks and goals

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Natural Language Engineering

Lead the way for us

Similar Papers

A Survey on Knowledge Graph Embeddings for Link Prediction
Meihong Wang ... Xiaoli Wang
Symmetry | VOL. 13
Meihong Wang, et. al.Meihong Wang ... Xiaoli Wang
16 Mar 2021
Symmetry | VOL. 13

CAFE: Knowledge graph completion using neighborhood-aware features
Agustín Borrego ... David Ruiz
Engineering Applications of Artificial Intelligence | VOL. 103
Agustín Borrego, et. al.Agustín Borrego ... David Ruiz
18 May 2021
Engineering Applications of Artificial Intelligence | VOL. 103

Sequence Encoder-based Spatiotemporal Knowledge Graph Completion
Wei Jia ... Xuan Wang
Journal of Web Engineering | VOL. -
Wei Jia, et. al.Wei Jia ... Xuan Wang
09 Nov 2022
Journal of Web Engineering | VOL. -

GFedKG: GNN-based federated embedding model for knowledge graph completion
Yuzhuo Wang ... Yu Yan
Knowledge-Based Systems | VOL. 301
Yuzhuo Wang, et. al.Yuzhuo Wang ... Yu Yan
29 Jul 2024
Knowledge-Based Systems | VOL. 301

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Benchmarks and goals

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Natural Language Engineering