Abstract

We introduce the Style Transformer for Authorship Representations (STAR) to detect and characterize writing style in social media. The model is trained on a heterogeneous large corpus derived from public sources with 4.5⋅106 authored texts from 70k authors leveraging Supervised Contrastive Loss to minimize the distance between texts authored by the same individual. This pretext pre-training task yields competitive performance at zero-shot with PAN challenges on attribution and clustering. We attain promising results on PAN verification challenges using STAR as a feature extractor. Finally, we present results from our test partition on Reddit, where using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy. We share our pre-trained model at huggingface AIDA-UPM/star and our code is available at jahuerta92/star.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call