Abstract
Language model pre-training architectures have demonstrated to be useful to learn language representations. bidirectional encoder representations from transformers (BERT), a recent deep bidirectional self-attention representation from unlabelled text, has achieved remarkable results in many natural language processing (NLP) tasks with fine-tuning. In this paper, we want to demonstrate the efficiency of BERT for a morphologically rich language, Turkish. Traditionally morphologically difficult languages require dense language pre-processing steps in order to model the data to be suitable for machine learning (ML) algorithms. In particular, tokenization, lemmatization or stemming and feature engineering tasks are needed to obtain an efficient data model to overcome data sparsity or high-dimension problems. In this context, we selected five various Turkish NLP research problems as sentiment analysis, cyberbullying identification, text classification, emotion recognition and spam detection from the literature. We then compared the empirical performance of BERT with the baseline ML algorithms. Finally, we found enhanced results compared to base ML algorithms in the selected NLP problems while eliminating heavy pre-processing tasks.
Highlights
There are many sources and types of information such as social media posts, micro-blogs, news and customer reviews that accumulates data progressively [1,2]
The automated analysis of the large amount of text data is predominantly handled with machine learning (ML) techniques applied to natural language processing (NLP) domain
Rich languages are analysed with the use of various pre-processing techniques that affect performance of the resultant ML model
Summary
There are many sources and types of information such as social media posts, micro-blogs, news and customer reviews that accumulates data progressively [1,2]. Word2Vec, GloVe or fastText embedding models depend on co-occurrence statistics of corpus used for pre-training Though these models are more eligible to comprehend semantic or syntactic information, the models are context independent and they generate only one vector embedding for each word. BERT may seem to be a solution to comprehend contextual meaning of morphologically complicated words [28,29,30] without mentioned language processing tasks This is one of the first motivations of this study. The results of this study will probably help the morphologically rich language researchers for Turkish language, in terms of applicability of BERT neural language model to the other real-world NLP tasks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.