Abstract

Cybercrime can be associated with undisclosed social media accounts deliberately used to conduct unethical or illegal activities such as cyberbullying, fraudulent transactions, human trafficking, etc. The objective of this paper is to identify whether two social media accounts belong to the same person by examining the accounts' writing, i.e. comments and posts. To that end, this preliminary study introduces a new algorithm, ChunkedHCs, specifically designed for the authorship verification task to decide whether a pair of texts are written by the same person. In the domain of machine learning and deep learning, there have been previous authorship verification approaches, which often involve complex feature selections or sophisticated pre-processing steps due to the complexity of topic heterogeneity. Such limits provide motivations to seek a simpler yet more robust approach that could offer competitive verification ability. ChunkedHCs is based on the statistical testing Higher Criticism (Donoho and Jin, 2004) and the HC-based similarity algorithm (Kipnis, 2020a & 2020b) (Kestemont et al., 2020). Using Reddit users’ data, ChunkedHCs offer a promising performance with an accuracy of 0.94 and an F1 of 0.9381 for texts between 29,000 and 30,000 characters. It is speculated that the algorithm could also be highly applicable to identify if two accounts are used by the same person for other social media platforms such as Facebook, Twitter and even dark web forums. Various avenues of further research on ChunkedHCs are also proposed.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.