Abstract

Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.

Highlights

  • There are many sources and types of information such as social media posts, micro-blogs, news and customer reviews that accumulates data progressively [1,2]

  • The automated analysis of the large amount of text data is predominantly handled with machine learning (ML) techniques applied to natural language processing (NLP) domain

  • Rich languages are analysed with the use of various pre-processing techniques that affect performance of the resultant ML model

Read more

Summary

Introduction

There are many sources and types of information such as social media posts, micro-blogs, news and customer reviews that accumulates data progressively [1,2]. Word2Vec, GloVe or fastText embedding models depend on co-occurrence statistics of corpus used for pre-training Though these models are more eligible to comprehend semantic or syntactic information, the models are context independent and they generate only one vector embedding for each word. BERT may seem to be a solution to comprehend contextual meaning of morphologically complicated words [28,29,30] without mentioned language processing tasks This is one of the first motivations of this study. The results of this study will probably help the morphologically rich language researchers for Turkish language, in terms of applicability of BERT neural language model to the other real-world NLP tasks.

Turkish language modelling challenges based on its morphological complexity
Frequent Turkish language processing pipeline and ML algorithms
Problem definition and BERT architecture
Definition of the problem
BERT architecture
BERT unsupervised pre-training tasks
Fine-tuning BERT in down-stream NLP tasks
NLP problems from Turkish language literature
Experimental study
BERT fine-tuning parameter selection
BERT architecture experiments
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.