An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection

Andres Garcia-Silva,Cristian Berrio,José Manuel Gómez-Pérez

doi:10.18653/v1/w19-4317

Abstract

Fine-tuning pre-trained language models has significantly advanced the state of art in a wide range of NLP downstream tasks. Usually, such language models are learned from large and well-formed text corpora from e.g. encyclopedic resources, books or news. However, a significant amount of the text to be analyzed nowadays is Web data, often from social media. In this paper we consider the research question: How do standard pre-trained language models generalize and capture the peculiarities of rather short, informal and frequently automatically generated text found in social media? To answer this question, we focus on bot detection in Twitter as our evaluation task and test the performance of fine-tuning approaches based on language models against popular neural architectures such as LSTM and CNN combined with pre-trained and contextualized embeddings. Our results also show strong performance variations among the different language model approaches, which suggest further research.

Highlights

Transfer learning techniques (Pan and Yang, 2010) based on language models have successfully delivered breaktrough accuracies in all kinds of downstream NLP tasks
The resulting language models are fine-tuned for the specific domain and task, continuously advancing the state of the art across the different evaluation tasks and benchmarks commonly used by the NLP community
From the different language models we evaluated, Open AI Generative Pretrained Transformer (GPT) beats BERT and ULMFit in the bot/no bot classification task, suggesting that a forward and unidirectional language model is more appropriated for social media messages than other language modeling architectures, which is relatively surprising

Summary

Introduction

Transfer learning techniques (Pan and Yang, 2010) based on language models have successfully delivered breaktrough accuracies in all kinds of downstream NLP tasks. Common practice for transfer learning in NLP was based on pre-trained contextindependent embeddings These are learned from large corpora and encode different types of syntactic and semantic relations that can be observed when operating on the vector space. In this paper we explore this question and empirically study how pre-trained embeddings and language models perform when used to analyze text from social media. Our results indicate that finetuned pre-trained language models outperform pre-trained and contextualized embeddings used in conjunction with CNN or LSTM for the task at hand This shows evidence that language models capture much of the peculiarities of social media and bot language or at least are flexible enough to generalize during fine-tuning in such context.

State of the Art

Experiments

Dataset

Pre-trained embeddings

CNN for text classification

Contextualized embeddings

Combining embeddings

Dynamic and pre-trained embeddings

Bidirectional long short term memory networks

Pre-trained languages models and fine-tuning

Findings

Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 22	License type: cc-by

Similar Papers

On the Power of Pre-Trained Text Representations
Yu Meng ... Jiawei Han
-
Yu Meng, et. al.Yu Meng ... Jiawei Han
14 Aug 2021
14 Aug 2021

Neural Transfer Learning For Vietnamese Sentiment Analysis Using Pre-trained Contextual Language Models
An Pha Le ... Tran Vu Pham
-
An Pha Le, et. al.An Pha Le ... Tran Vu Pham
16 Dec 2021
16 Dec 2021

Towards an Enhanced Understanding of Bias in Pre-trained Neural Language Models: A Survey with Special Emphasis on Affective Bias
Anoop K ... Lajish V L
-
Anoop K, et. al. Anoop K ... Lajish V L
01 Jan 2021
01 Jan 2021

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages
Divyanshu Kakwani ... Anoop Kunchukuttan
-
Divyanshu Kakwani, et. al.Divyanshu Kakwani ... Anoop Kunchukuttan
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers