Abstract

Fine-tuning pre-trained language models has significantly advanced the state of art in a wide range of NLP downstream tasks. Usually, such language models are learned from large and well-formed text corpora from e.g. encyclopedic resources, books or news. However, a significant amount of the text to be analyzed nowadays is Web data, often from social media. In this paper we consider the research question: How do standard pre-trained language models generalize and capture the peculiarities of rather short, informal and frequently automatically generated text found in social media? To answer this question, we focus on bot detection in Twitter as our evaluation task and test the performance of fine-tuning approaches based on language models against popular neural architectures such as LSTM and CNN combined with pre-trained and contextualized embeddings. Our results also show strong performance variations among the different language model approaches, which suggest further research.

Highlights

  • Transfer learning techniques (Pan and Yang, 2010) based on language models have successfully delivered breaktrough accuracies in all kinds of downstream NLP tasks

  • The resulting language models are fine-tuned for the specific domain and task, continuously advancing the state of the art across the different evaluation tasks and benchmarks commonly used by the NLP community

  • From the different language models we evaluated, Open AI Generative Pretrained Transformer (GPT) beats BERT and ULMFit in the bot/no bot classification task, suggesting that a forward and unidirectional language model is more appropriated for social media messages than other language modeling architectures, which is relatively surprising

Read more

Summary

Introduction

Transfer learning techniques (Pan and Yang, 2010) based on language models have successfully delivered breaktrough accuracies in all kinds of downstream NLP tasks. Common practice for transfer learning in NLP was based on pre-trained contextindependent embeddings These are learned from large corpora and encode different types of syntactic and semantic relations that can be observed when operating on the vector space. In this paper we explore this question and empirically study how pre-trained embeddings and language models perform when used to analyze text from social media. Our results indicate that finetuned pre-trained language models outperform pre-trained and contextualized embeddings used in conjunction with CNN or LSTM for the task at hand This shows evidence that language models capture much of the peculiarities of social media and bot language or at least are flexible enough to generalize during fine-tuning in such context.

State of the Art
Experiments
Dataset
Pre-trained embeddings
CNN for text classification
Contextualized embeddings
Combining embeddings
Dynamic and pre-trained embeddings
Bidirectional long short term memory networks
Pre-trained languages models and fine-tuning
Findings
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call