Abstract

Utilisation of transfer learning with deep language models is regarded as one of the most important developments in deep learning. Their application on real-time high-velocity and volume user-generated data has been elusive due to the unprecedented size and complexity of the models which result in substantial computational overhead. Recent iterations of these architectures have produced significantly distilled models with state-of-the-art performance and reduced resource requirement. We utilize deep transformer language models on user-generated data alongside a robust text normalization pipeline to address what is considered as the Achilles heel of deep learning on user-generated text data, namely data normalization.In this paper, we propose a framework for the ingestion, analysis and storage of real-time data streams. A case study in sentiment analysis and offensive/hateful language detection is used to evaluate the framework. We demonstrate inference on a large Twitter dataset using CPU and GPU clusters, highlighting the viability of the fine-tuned distilled language model for high volume data. Fine-tuned model significantly outperforms previous state-of-the-art on several benchmark datasets, providing a powerful model that can be utilized for a variety of downstream tasks. To our knowledge, this is the only study demonstrating powerful transformer language models for realtime social media stream analytics in a distributed setting.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call