IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents

Agus Riyadi,Mate Kovacs,Uwe Serdült,Victor Kryssanov

doi:10.3390/bdcc8110153

Abstract

Achieving the Sustainable Development Goals (SDGs) requires collaboration among various stakeholders, particularly governments and non-state actors (NSAs). This collaboration results in but is also based on a continually growing volume of documents that needs to be analyzed and processed in a systematic way by government officials. Artificial Intelligence and Natural Language Processing (NLP) could, thus, offer valuable support for progressing towards SDG targets, including automating the government budget tagging and classifying NSA requests and initiatives, as well as helping uncover the possibilities for matching these two categories of activities. Many non-English speaking countries, including Indonesia, however, face limited NLP resources, such as, for instance, domain-specific pre-trained language models (PTLMs). This circumstance makes it difficult to automate document processing and improve the efficacy of SDG-related government efforts. The presented study introduces IndoGovBERT, a Bidirectional Encoder Representations from Transformers (BERT)-based PTLM built with domain-specific corpora, leveraging the Indonesian government’s public and internal documents. The model is intended to automate various laborious tasks of SDG document processing by the Indonesian government. Different approaches to PTLM development known from the literature are examined in the context of typical government settings. The most effective, in terms of the resultant model performance, but also most efficient, in terms of the computational resources required, methodology is determined and deployed for the development of the IndoGovBERT model. The developed model is then scrutinized in several text classification and similarity assessment experiments, where it is compared with four Indonesian general-purpose language models, a non-transformer approach of the Multilabel Topic Model (MLTM), as well as with a Multilingual BERT model. Results obtained in all experiments highlight the superior capability of the IndoGovBERT model for Indonesian government SDG document processing. The latter suggests that the proposed PTLM development methodology could be adopted to build high-performance specialized PTLMs for governments around the globe which face SDG document processing and other NLP challenges similar to the ones dealt with in the presented study.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents

Abstract

Talk to us

Similar Papers

More From: Big Data and Cognitive Computing

Lead the way for us

Journal: Big Data and Cognitive Computing	Publication Date: Nov 9, 2024
License type: CC BY 4.0

Similar Papers

Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
Rajesh Gupta
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3
Rajesh GuptaRajesh Gupta
02 Mar 2024
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3

MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain
Muzamil Hussain Syed ... Sun-Tae Chung
Applied Sciences | VOL. 11
Muzamil Hussain Syed, et. al.Muzamil Hussain Syed ... Sun-Tae Chung
28 Jun 2021
Applied Sciences | VOL. 11

Tibetan Sentence Boundaries Automatic Disambiguation Based on Bidirectional Encoder Representations from Transformers on Byte Pair Encoding Word Cutting Method
Fenfang Li ... Han Deng
Applied Sciences | VOL. 14
Fenfang Li, et. al.Fenfang Li ... Han Deng
02 Apr 2024
Applied Sciences | VOL. 14

T-BERT:臺灣語言模型–以臺灣在地語言預訓練BERT模型

-

01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

IndoGovBERT: A Domain-Specific Language Model for Processing Indonesian Government SDG Documents

Abstract

Talk to us

Similar Papers

More From: Big Data and Cognitive Computing