Turkish is one of the most spoken languages in the world; however, it is still among the low-resource languages. Wide us of this language on social media platforms such as Twitter, Instagram, or Tiktok and strategic position of the country in the world politics makes it appealing for the social network researchers and industry. To address this need, we introduce TurkishBERTweet, the first large scale pre-trained language model for Turkish social media built using over 894 million Turkish tweets. The model shares the same architecture as RoBERTa-base model with smaller input length, making TurkishBERTweet lighter than the most used model, called BERTurk, and can have significantly lower inference time. We trained our model using the same approach for RoBERTa model and evaluated on two tasks: Sentiment Classification and Hate Speech Detection. We demonstrate that TurkishBERTweet outperforms the other available alternatives on generalizability and its lower inference time gives significant advantage to process large-scale datasets. We also show custom preprocessors for social media can acquire information from platform specific entities. We also conduct comparison with the commercial solutions like OpenAI and Gemini, and other available Turkish LLMs in terms of cost and performance to demonstrate TurkishBERTweet is scalable and cost-effective.
Read full abstract