Social Media Topic Classification on Greek Reddit

Charalampos Mastrokostas,Nikos Karacapilidis,Nikolaos Giarelis

doi:10.3390/info15090521

Abstract

Text classification (TC) is a subtask of natural language processing (NLP) that categorizes text pieces into predefined classes based on their textual content and thematic aspects. This process typically includes the training of a Machine Learning (ML) model on a labeled dataset, where each text example is associated with a specific class. Recent progress in Deep Learning (DL) enabled the development of deep neural transformer models, surpassing traditional ML ones. In any case, works of the topic classification literature prioritize high-resource languages, particularly English, while research efforts for low-resource ones, such as Greek, are limited. Taking the above into consideration, this paper presents: (i) the first Greek social media topic classification dataset; (ii) a comparative assessment of a series of traditional ML models trained on this dataset, utilizing an array of text vectorization methods including TF-IDF, classical word and transformer-based Greek embeddings; (iii) a fine-tuned GREEK-BERT-based TC model on the same dataset; (iv) key empirical findings demonstrating that transformer-based embeddings significantly increase the performance of traditional ML models, while our fine-tuned DL model outperforms previous ones. The dataset, the best-performing model, and the experimental code are made public, aiming to augment the reproducibility of this work and advance future research in the field.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Social Media Topic Classification on Greek Reddit

Abstract

Talk to us

Similar Papers

More From: Information

Lead the way for us

Journal: Information	Publication Date: Aug 26, 2024
License type: CC BY 4.0

Similar Papers

Comparative Performance of Autoencoders and Traditional Machine Learning Algorithms in Clinical Data Analysis for Predicting Post-Staged GKRS Tumor Dynamics.
Simona Ruxandra Volovăț ... Cristian Constantin Volovăț
Diagnostics (Basel, Switzerland) | VOL. 14
Simona Ruxandra Volovăț, et. al.Simona Ruxandra Volovăț ... Cristian Constantin Volovăț
21 Sep 2024
Diagnostics (Basel, Switzerland) | VOL. 14

A natural language processing approach to skill identification in the job market
Sai Harshith Suram ... Sindhu Ghanta
Journal of Emerging Investigators | VOL. -
Sai Harshith Suram, et. al.Sai Harshith Suram ... Sindhu Ghanta
01 Jan 2024
Journal of Emerging Investigators | VOL. -

Comparing Few-Shot Learning with GPT-3 to Traditional Machine Learning Approaches for Classifying Teacher Simulation Responses
Joshua Littenberg-Tobias ... Garron Hillaire
-
Joshua Littenberg-Tobias, et. al.Joshua Littenberg-Tobias ... Garron Hillaire
01 Jan 2021
01 Jan 2021

ML models for severity classification and length-of-stay forecasting in emergency units
Jonathan Moya-Carvajal ... Alfredo Candia-Véjar
Expert Systems with Applications | VOL. 223
Jonathan Moya-Carvajal, et. al.Jonathan Moya-Carvajal ... Alfredo Candia-Véjar
15 Mar 2023
Expert Systems with Applications | VOL. 223

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Social Media Topic Classification on Greek Reddit

Abstract

Talk to us

Similar Papers

More From: Information