Abstract

The automatic classification of posts from hacking-related online forums is of potential value for the understanding of user behaviour in social networks relating to cybercrime. We designed annotation schema to label forum posts for three properties: post type, author intent, and addressee. The post type indicates whether the text is a question, a comment, and so on. The author’s intent in writing the post could be positive, negative, moderating discussion, showing gratitude to another user, etc. The addressee of a post tends to be a general audience (e.g. other forum users) or individual users who have already contributed to a threaded discussion. We manually annotated a sample of posts and returned substantial agreement for post type and addressee, and fair agreement for author intent. We trained rule-based (logical) and machine learning (statistical) classification models to predict these labels automatically, and found that a hybrid logical–statistical model performs best for post type and author intent, whereas a purely statistical model is best for addressee. We discuss potential applications for this data, including the analysis of thread conversations in forum data and the identification of key actors within social networks.

Highlights

  • Underground communities attract actors interested in illicit and black hat articles

  • In the end we find that post type and author intent are best-served by a hybrid logical–statistical approach, while addressee can be most accurately predicted through a statistical model

  • We propose that for post type and author intent, performance is not overwhelmingly good enough to completely discard the heuristics from our logical models— the baseline decision lists (B2) outperform the statistical models for these annotation types—one problem being that too many predictions are shifted back to the label most frequently found in training, the B1 mode in other words

Read more

Summary

Introduction

Underground communities attract actors interested in illicit and black hat articles. Web forums are used for the exchange of knowledge and trading of illegal tools and services, such as malware, services to perform denial-of-service attacks or zero-day exploits. Understanding the social relationships and evolution of actors in these forums is of potential interest to design early intervention approaches or effective countermeasures. The analysis of these forums is challenging for various reasons. The large volume of data requires automatic tools for extracting knowledge (see an overview of "Related work" section). The use of nonstandard language, including specific jargon and frequent spelling and grammatical errors makes the use of standard language processing tools infeasible

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.