Abstract

AbstractBenchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC). Many KGC algorithms have been proposed such as Trans[DEHRM], but it remains to be seen how this work improves WordNet coverage. Given how much work is based on these benchmarks, the literature should have more to say than it does about the connection between benchmarks and goals. Is optimizing P@10 on WN18RR likely to produce more complete knowledge graphs? Is MUSE likely to improve Machine Translation?

Highlights

  • Many papers in top conferences these days propose methods and test them on standard benchmarks such as General Language Understanding Evaluation (GLUE)a (Wang et al 2018), Multilingual Unsupervised and Supervised Embeddings (MUSE)b (Conneau et al 2017), and wordnet 18 reduced relations (WN18RR) (Dettmers et al 2018)

  • Benchmarks have a way of taking on a life of their own

  • Before discussing some of the history behind MUSE and WN18RR, benchmarks for bilingual lexicon induction (BLI) and knowledge graph completion (KGC), it is useful to say a few words about goals

Read more

Summary

Introduction

Many (perhaps most) papers in top conferences these days propose methods and test them on standard benchmarks such as General Language Understanding Evaluation (GLUE)a (Wang et al 2018), Multilingual Unsupervised and Supervised Embeddings (MUSE)b (Conneau et al 2017), and wordnet 18 reduced relations (WN18RR) (Dettmers et al 2018). Some of these methods (Bidirectional Encoder Representations from Transformers (BERT) Devlin et al 2019, enhanced representation through knowledge integration (ERINE) Sun et al 2020) do well on benchmarks and do well on tasks that we care about.c despite large numbers of papers with promising performance on benchmarks, there is remarkably little evidence of generalizations beyond benchmarks. This history is quickly forgotten as attention moves to SOTA numbers, and away from sensible (credible and worthwhile) motivations

The goal
Background
F where
WordNet is incomplete because it is too English-centric
Comparisons of WordNet and MUSE
WordNet as a database
History and motivation for BLI and MUSE benchmark
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.