Abstract

This study proposes an integrated framework that considers letter-pair frequencies/combinations along with the lexical features of documents as a means to identifying the authorship of short texts posted anonymously on social media. Taking a quantitative morpho-lexical approach, this study tests the hypothesis that letter information, or mapping, can identify unique stylistic features. As such, stable word combinations and morphological patterns can be used successfully for authorship detection in relation to very short texts. This method offers significant potential in the fight against online hate speech, which is often posted anonymously and where authorship is difficult to identify. The data analyzed is from a corpus of 12,240 tweets derived from 87 Twitter accounts. A self-organizing map (SOM) model was used to classify input patterns in the tweets that shared common features. Tweets grouped in a particular class displayed features that suggested they were written by a particular author. The results indicate that the accuracy of classification according to the proposed system was around 76%. Up to 22% of this accuracy was lost, however, when only distinctive words were used and 26% was lost when the classification procedure was based solely on letter combinations and morphological patterns. The integration of letter-pairs and morphological patterns had the advantage of improving accuracy when determining the author of a given tweet. This indicates that the integration of different linguistic variables into an integrated system leads to better performance in classifying very short texts. It is also clear that the use of a self-organizing map (SOM) led to better clustering performance because of its capacity to integrate two different linguistic levels for each author profile.

Highlights

  • With the world-wide impact of computer and internet services on modern life, unprecedented problems and crimes have come to the surface with many negative repercussions

  • In order to address the limitations of current quantitative linguistic approaches to authorship detection in very short texts, this paper has proposed a new method that considers letter-pair frequencies/combinations along with the lexical features of documents

  • Given the uniqueness of language in social media, it is believed that letter information or mapping carries unique stylistic features that can be used alongside analysis of lexical features to enhance authorship detection in relation to very short texts

Read more

Summary

Introduction

With the world-wide impact of computer and internet services on modern life, unprecedented problems and crimes have come to the surface with many negative repercussions. Challenges relating to the practical applications of authorship detection remain. One of these challenges is how to identify authors of very short texts, especially in social media applications. Forensic text types are usually very short and have minimal linguistic features It is difficult for forensic linguists to develop robust evidence as to authorship due to the lack of sufficient linguistic data available to them

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call