IndoKEPLER, IndoWiki, and IndoLAMA: A Knowledge-enhanced Language Model, Dataset, and Benchmark for the Indonesian Language

Inigo Ramli,Adila Alfa Krisnadhi,Radityo Eko Prasojo

doi:10.1109/iwbis56557.2022.9924844

Abstract

Pretrained language models posses an ability to learn the structural representation of a natural language by processing unstructured textual data. However, the current language model design lacks the ability to learn factual knowledge from knowledge graphs. Several attempts have been made to address this issue, such as the development of KEPLER. KEPLER combines the BERT language model and TransE knowledge embedding method to achieve a language model that can incorporate knowledge graphs as training data. Unfortunately, such knowledge enhanced language model is not yet available for the Indonesian language. In this experiment, we propose IndoKEPLER: a language model trained usingWikipedia Bahasa Indonesia andWikidata. We also create a new knowledge probing benchmark named IndoLAMA to test the ability of a language model to recall factual knowledge. The benchmark is based on LAMA, which is designed to test the suitability of our language model to be used as a knowledge base. IndoLAMA tests a language model by giving cloze style question and compare the prediction of the model to the factually correct answer. This experiment shows that IndoKEPLER increases the ability of a normal DistilBERT model to recall factual knowledge by 0.8%. Moreover, the most significant increase happens when dealing with many-to-one relationships, where IndoKEPLER outperforms it’s original text encoder model by 3%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

IndoKEPLER, IndoWiki, and IndoLAMA: A Knowledge-enhanced Language Model, Dataset, and Benchmark for the Indonesian Language

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling
Robert Logan ... Nelson F Liu
-
Robert Logan, et. al.Robert Logan ... Nelson F Liu
01 Jan 2019
01 Jan 2019

Modelo Acústico y de Lenguaje del Idioma Español para el dialecto Cucuteño, Orientado al Reconocimiento Automático del Habla
Juan David Celis Nuñez ... Rodrigo Andres Llanos Castro
Ingeniería | VOL. 22
Juan David Celis Nuñez, et. al.Juan David Celis Nuñez ... Rodrigo Andres Llanos Castro
12 Sep 2017
Ingeniería | VOL. 22

Building Acoustic and Language Model for Continuous Speech Recognition in Bahasa Indonesia
Andreas Widjaja ... Vincent Elbert Budiman
Jurnal Teknik Informatika dan Sistem Informasi | VOL. 6
Andreas Widjaja, et. al.Andreas Widjaja ... Vincent Elbert Budiman
10 Aug 2020
Jurnal Teknik Informatika dan Sistem Informasi | VOL. 6

Recurrent neural network language model adaptation with curriculum learning
Yangyang Shi ... Catholijn M Jonker
Computer Speech & Language | VOL. 33
Yangyang Shi, et. al.Yangyang Shi ... Catholijn M Jonker
01 Dec 2014
Computer Speech & Language | VOL. 33

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

IndoKEPLER, IndoWiki, and IndoLAMA: A Knowledge-enhanced Language Model, Dataset, and Benchmark for the Indonesian Language

Abstract

Talk to us

Similar Papers