Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Akın Özçift,Kamil Akarsu,Fatma Yumuk,Cevhernur Söylemez

doi:10.1080/00051144.2021.1922150

Akın Özçift, Kamil Akarsu + Show 2 more

Open Access

https://doi.org/10.1080/00051144.2021.1922150

Copy DOI

Journal: Automatika	Publication Date: Apr 3, 2021
Citations: 20	License type: open-access

Affiliation: Manisa Celal Bayar University

Abstract

Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.

Highlights

There are many sources and types of information such as social media posts, micro-blogs, news and customer reviews that accumulates data progressively [1,2]
The automated analysis of the large amount of text data is predominantly handled with machine learning (ML) techniques applied to natural language processing (NLP) domain
Rich languages are analysed with the use of various pre-processing techniques that affect performance of the resultant ML model

Summary

Introduction

There are many sources and types of information such as social media posts, micro-blogs, news and customer reviews that accumulates data progressively [1,2]. Word2Vec, GloVe or fastText embedding models depend on co-occurrence statistics of corpus used for pre-training Though these models are more eligible to comprehend semantic or syntactic information, the models are context independent and they generate only one vector embedding for each word. BERT may seem to be a solution to comprehend contextual meaning of morphologically complicated words [28,29,30] without mentioned language processing tasks This is one of the first motivations of this study. The results of this study will probably help the morphologically rich language researchers for Turkish language, in terms of applicability of BERT neural language model to the other real-world NLP tasks.

Turkish language modelling challenges based on its morphological complexity

Frequent Turkish language processing pipeline and ML algorithms

Problem definition and BERT architecture

Definition of the problem

BERT architecture

BERT unsupervised pre-training tasks

Fine-tuning BERT in down-stream NLP tasks

NLP problems from Turkish language literature

Experimental study

BERT fine-tuning parameter selection

BERT architecture experiments

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Automatika

Lead the way for us

Similar Papers

Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
Rajesh Gupta
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3
Rajesh GuptaRajesh Gupta
02 Mar 2024
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3

Bert model fine-tuning for text classification in knee OA radiology reports
L Chen ... V Pedoia
Osteoarthritis and Cartilage | VOL. 28
L Chen, et. al.L Chen ... V Pedoia
01 Apr 2020
Osteoarthritis and Cartilage | VOL. 28

Performance Comparison of Machine Learning and Deep Learning Algorithms in Detecting Online Hate Speech
F H A Shibly ... H M M Naleer
-
F H A Shibly, et. al.F H A Shibly ... H M M Naleer
27 Sep 2022
27 Sep 2022

An innovative approach to classify hierarchical remarks with multi-class using BERT and customized naïve bayes classifier
M.M Dhina ... S Sumathi
International Journal of Engineering, Science and Technology | VOL. 13
M.M Dhina, et. al.M.M Dhina ... S Sumathi
30 May 2022
International Journal of Engineering, Science and Technology | VOL. 13

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Advancing natural language processing (NLP) applications of morphologically rich languages with bidirectional encoder representations from transformers (BERT): an empirical case study for Turkish

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Automatika