DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Bharathi Raja Chakravarthi,Ruba Priyadharshini,Vigneshwaran Muralidaran,Elizabeth Sherly,John P Mccrae,Shardul Suryawanshi,Navya Jose

doi:10.1007/s10579-022-09583-7

Abstract

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Language Resources and Evaluation	Publication Date: Feb 4, 2022
Citations: 43	License type: open-access

R Discovery Prime

R Discovery Prime

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Abstract

Talk to us

Similar Papers

More From: Language Resources and Evaluation

Lead the way for us

Similar Papers

DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
...
arXiv (Cornell University) | VOL. -
, et. al. ...
12 May 2021
arXiv (Cornell University) | VOL. -

Overview of the track on Sentiment Analysis for Dravidian Languages in Code-Mixed Text
Bharathi Raja Chakravarthi ... Elizabeth Sherly
-
Bharathi Raja Chakravarthi, et. al.Bharathi Raja Chakravarthi ... Elizabeth Sherly
16 Dec 2020
16 Dec 2020

Guide for the application of the data augmentation approach on sets of texts in Spanish for sentiment and emotion analysis.
Rodrigo Gutiérrez Benítez ... Claudia Martínez-Araneda
PloS one | VOL. 19
Rodrigo Gutiérrez Benítez, et. al.Rodrigo Gutiérrez Benítez ... Claudia Martínez-Araneda
01 Jan 2024
PloS one | VOL. 19

Multi-task learning in under-resourced Dravidian languages
Adeep Hande ... Siddhanth U Hegde
Journal of Data, Information and Management | VOL. 4
Adeep Hande, et. al.Adeep Hande ... Siddhanth U Hegde
01 Jun 2022
Journal of Data, Information and Management | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Abstract

Talk to us

Similar Papers

More From: Language Resources and Evaluation