Abstract
AbstractBenchmarks can be a useful step toward the goals of the field (when the benchmark is on the critical path), as demonstrated by the GLUE benchmark, and deep nets such as BERT and ERNIE. The case for other benchmarks such as MUSE and WN18RR is less well established. Hopefully, these benchmarks are on a critical path toward progress on bilingual lexicon induction (BLI) and knowledge graph completion (KGC). Many KGC algorithms have been proposed such as Trans[DEHRM], but it remains to be seen how this work improves WordNet coverage. Given how much work is based on these benchmarks, the literature should have more to say than it does about the connection between benchmarks and goals. Is optimizing P@10 on WN18RR likely to produce more complete knowledge graphs? Is MUSE likely to improve Machine Translation?
Highlights
Many papers in top conferences these days propose methods and test them on standard benchmarks such as General Language Understanding Evaluation (GLUE)a (Wang et al 2018), Multilingual Unsupervised and Supervised Embeddings (MUSE)b (Conneau et al 2017), and wordnet 18 reduced relations (WN18RR) (Dettmers et al 2018)
Benchmarks have a way of taking on a life of their own
Before discussing some of the history behind MUSE and WN18RR, benchmarks for bilingual lexicon induction (BLI) and knowledge graph completion (KGC), it is useful to say a few words about goals
Summary
Many (perhaps most) papers in top conferences these days propose methods and test them on standard benchmarks such as General Language Understanding Evaluation (GLUE)a (Wang et al 2018), Multilingual Unsupervised and Supervised Embeddings (MUSE)b (Conneau et al 2017), and wordnet 18 reduced relations (WN18RR) (Dettmers et al 2018). Some of these methods (Bidirectional Encoder Representations from Transformers (BERT) Devlin et al 2019, enhanced representation through knowledge integration (ERINE) Sun et al 2020) do well on benchmarks and do well on tasks that we care about.c despite large numbers of papers with promising performance on benchmarks, there is remarkably little evidence of generalizations beyond benchmarks. This history is quickly forgotten as attention moves to SOTA numbers, and away from sensible (credible and worthwhile) motivations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.