The Evolution of Language Models Applied to Emotion Analysis of Arabic Tweets

Nora Al-Twairesh

doi:10.3390/info12020084

Nora Al-Twairesh

Open Access

https://doi.org/10.3390/info12020084

Copy DOI

Journal: Information	Publication Date: Feb 17, 2021
Citations: 18	License type: CC BY 4.0

Affiliation: King Saud University

Abstract

The field of natural language processing (NLP) has witnessed a boom in language representation models with the introduction of pretrained language models that are trained on massive textual data then used to fine-tune downstream NLP tasks. In this paper, we aim to study the evolution of language representation models by analyzing their effect on an under-researched NLP task: emotion analysis; for a low-resource language: Arabic. Most of the studies in the field of affect analysis focused on sentiment analysis, i.e., classifying text into valence (positive, negative, neutral) while few studies go further to analyze the finer grained emotional states (happiness, sadness, anger, etc.). Emotion analysis is a text classification problem that is tackled using machine learning techniques. Different language representation models have been used as features for these machine learning models to learn from. In this paper, we perform an empirical study on the evolution of language models, from the traditional term frequency–inverse document frequency (TF–IDF) to the more sophisticated word embedding word2vec, and finally the recent state-of-the-art pretrained language model, bidirectional encoder representations from transformers (BERT). We observe and analyze how the performance increases as we change the language model. We also investigate different BERT models for Arabic. We find that the best performance is achieved with the ArabicBERT large model, which is a BERT model trained on a large dataset of Arabic text. The increase in F1-score was significant +7–21%.

Highlights

IntroductionThe basic building block of language is words, in natural language processing (NLP) we need to convert words into numerical format to compose a suitable representation that can help machines to understand language
Language is complex and processing it computationally is not straight forward
First Study Since the main objective in this paper is to study the impact of the evolution of language models, the experiments conducted in the first study are as follows: 1. Emotion classification using support vector machine (SVM) and term frequency–inverse document frequency (TF–IDF) as features

Summary

Introduction

The basic building block of language is words, in natural language processing (NLP) we need to convert words into numerical format to compose a suitable representation that can help machines to understand language. As it consists of different knowledge blocks: phonemes (speech and sound), morphology (words: morphemes and lexemes), syntax (phrases and sentences), and semantic (meaning and context). Words are composed of different morphemes and lexemes, are used to compose phrases and sentences, and have different meanings according to the context they appear in All these different knowledge blocks have to be considered when we want to convert words into numerical format to compose a suitable representation that can help ML models to understand language and perform better on the various NLP tasks and applications. The different approaches for constructing these vectors are called language representation models or language models (LM)

Objectives

Methods

Results

Conclusion