Abstract

Abstract Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text. In contrast, knowledge embedding (KE) methods can effectively represent the relational facts in knowledge graphs (KGs) with informative entity embeddings, but conventional KE models cannot take full advantage of the abundant textual information. In this paper, we propose a unified model for Knowledge Embedding and Pre-trained LanguagERepresentation (KEPLER), which can not only better integrate factual knowledge into PLMs but also produce effective text-enhanced KE with the strong PLMs. In KEPLER, we encode textual entity descriptions with a PLM as their embeddings, and then jointly optimize the KE and language modeling objectives. Experimental results show that KEPLER achieves state-of-the-art performances on various NLP tasks, and also works remarkably well as an inductive KE model on KG link prediction. Furthermore, for pre-training and evaluating KEPLER, we construct Wikidata5M1 , a large-scale KG dataset with aligned entity descriptions, and benchmark state-of-the-art KE methods on it. It shall serve as a new KE benchmark and facilitate the research on large KG, inductive KE, and KG with text. The source code can be obtained from https://github.com/THU-KEG/KEPLER.

Highlights

  • Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text

  • We propose KEPLER, a unified model for Knowledge Embedding and Pre-trained LanguagE Representation

  • (3) We introduce Wikidata5M, a new large-scale knowledge graphs (KGs) dataset, which shall promote the research on large-scale KG, inductive knowledge embedding (KE), and the interactions between KG and natural language processing (NLP)

Read more

Summary

Introduction

Pre-trained language representation models (PLMs) cannot well capture factual knowledge from text. For pre-training and evaluating KEPLER, we construct Wikidata5M1, a large-scale KG dataset with aligned entity descriptions, and benchmark state-of-the-art KE methods on it. It shall serve as a new KE benchmark and facilitate the research on large KG, inductive KE, and KG with text. For pre-training and evaluating KEPLER, we need a KG with (1) large amounts of knowledge facts, (2) aligned entity descriptions, and (3) reasonable inductive-setting data split, which cannot be satisfied by existing KE benchmarks. Our contribution is three-fold: (1) We propose KEPLER, a knowledge-enhanced PLM by jointly optimizing the KE and MLM objectives, which brings great improvements on a wide range of NLP tasks. (2) By encoding text descriptions as entity embeddings, KEPLER shows its effectiveness as a KE model, especially in the inductive setting. (3) We introduce Wikidata5M, a new large-scale KG dataset, which shall promote the research on large-scale KG, inductive KE, and the interactions between KG and NLP

KEPLER
Encoder
Knowledge Embedding
Masked Language Modeling
Training Objectives
Variants and Implementations
Wikidata5M
Data Collection
Data Split
Benchmark
Experimental Setting
NLP Tasks
KE Tasks
Ablation Study
Knowledge Probing Experiment
Running Time Comparison
Correlation with Entity Frequency
Understanding Text or Storing Knowledge
Related Work
Findings
Conclusion and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.