Abstract
Fine-tuning pre-trained language models has significantly advanced the state of art in a wide range of NLP downstream tasks. Usually, such language models are learned from large and well-formed text corpora from e.g. encyclopedic resources, books or news. However, a significant amount of the text to be analyzed nowadays is Web data, often from social media. In this paper we consider the research question: How do standard pre-trained language models generalize and capture the peculiarities of rather short, informal and frequently automatically generated text found in social media? To answer this question, we focus on bot detection in Twitter as our evaluation task and test the performance of fine-tuning approaches based on language models against popular neural architectures such as LSTM and CNN combined with pre-trained and contextualized embeddings. Our results also show strong performance variations among the different language model approaches, which suggest further research.
Highlights
Transfer learning techniques (Pan and Yang, 2010) based on language models have successfully delivered breaktrough accuracies in all kinds of downstream NLP tasks
The resulting language models are fine-tuned for the specific domain and task, continuously advancing the state of the art across the different evaluation tasks and benchmarks commonly used by the NLP community
From the different language models we evaluated, Open AI Generative Pretrained Transformer (GPT) beats BERT and ULMFit in the bot/no bot classification task, suggesting that a forward and unidirectional language model is more appropriated for social media messages than other language modeling architectures, which is relatively surprising
Summary
Transfer learning techniques (Pan and Yang, 2010) based on language models have successfully delivered breaktrough accuracies in all kinds of downstream NLP tasks. Common practice for transfer learning in NLP was based on pre-trained contextindependent embeddings These are learned from large corpora and encode different types of syntactic and semantic relations that can be observed when operating on the vector space. In this paper we explore this question and empirically study how pre-trained embeddings and language models perform when used to analyze text from social media. Our results indicate that finetuned pre-trained language models outperform pre-trained and contextualized embeddings used in conjunction with CNN or LSTM for the task at hand This shows evidence that language models capture much of the peculiarities of social media and bot language or at least are flexible enough to generalize during fine-tuning in such context.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.