Understanding writing style in social media with a supervised contrastively pre-trained transformer

Javier Huertas-Tato,Alejandro Martín,David Camacho

doi:10.1016/j.knosys.2024.111867

Abstract

We introduce the Style Transformer for Authorship Representations (STAR) to detect and characterize writing style in social media. The model is trained on a heterogeneous large corpus derived from public sources with 4.5⋅106 authored texts from 70k authors leveraging Supervised Contrastive Loss to minimize the distance between texts authored by the same individual. This pretext pre-training task yields competitive performance at zero-shot with PAN challenges on attribution and clustering. We attain promising results on PAN verification challenges using STAR as a feature extractor. Finally, we present results from our test partition on Reddit, where using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy. We share our pre-trained model at huggingface AIDA-UPM/star and our code is available at jahuerta92/star.

Full Text