Abstract

The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data then used to fine-tune downstream NLP tasks. In this paper, we aim to study the evolution of language representation models by analyzing their effect on an under-researched NLP task: emotion analysis; for a low-resource language: Arabic. Most of the studies in the field of affect analysis focused on sentiment analysis, i.e., classifying text into valence (positive, negative, neutral) while few studies go further to analyze the finer grained emotional states (happiness, sadness, anger, etc.). Emotion analysis is a text classification problem that is tackled using machine learning techniques. Different language representation models have been used as features for these machine learning models to learn from. In this paper, we perform an empirical study on the evolution of language models, from the traditional term frequency–inverse document frequency (TF–IDF) to the more sophisticated word embedding word2vec, and finally the recent state-of-the-art pretrained language model, bidirectional encoder representations from transformers (BERT). We observe and analyze how the performance increases as we change the language model. We also investigate different BERT models for Arabic. We find that the best performance is achieved with the ArabicBERT large model, which is a BERT model trained on a large dataset of Arabic text. The increase in F1-score was significant +7–21%.

Highlights

  • IntroductionThe basic building block of language is words, in natural language processing (NLP) we need to convert words into numerical format to compose a suitable representation that can help machines to understand language

  • Language is complex and processing it computationally is not straight forward

  • First Study Since the main objective in this paper is to study the impact of the evolution of language models, the experiments conducted in the first study are as follows: 1. Emotion classification using support vector machine (SVM) and term frequency–inverse document frequency (TF–IDF) as features

Read more

Summary

Introduction

The basic building block of language is words, in natural language processing (NLP) we need to convert words into numerical format to compose a suitable representation that can help machines to understand language. As it consists of different knowledge blocks: phonemes (speech and sound), morphology (words: morphemes and lexemes), syntax (phrases and sentences), and semantic (meaning and context). Words are composed of different morphemes and lexemes, are used to compose phrases and sentences, and have different meanings according to the context they appear in All these different knowledge blocks have to be considered when we want to convert words into numerical format to compose a suitable representation that can help ML models to understand language and perform better on the various NLP tasks and applications. The different approaches for constructing these vectors are called language representation models or language models (LM)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call