Abstract

Methods and tools to conduct authorship analysis of web contents is of growing interest to researchers and practitioners in various security-focused disciplines, including cybersecurity, counter-terrorism, and other fields in which authorship of text may at times be uncertain or obfuscated. Here we demonstrate an automated approach for authorship analysis of web contents. Analysis is conducted through the use of machine learning methodologies, an expansive stylometric feature set, and a series of visualizations intended to help facilitate authorship analysis at the author, message, and feature levels. To operationalize this, we utilize a testbed containing 506,554 forum messages in English and Arabic, source from 14,901 authors that participated in an online web forum. A prototype portal system providing authorship comparisons and visualizations was then designed and constructed in order to support feasibility analysis and real world value of the automated authorship analysis approach. A preliminary user evaluation was performed to assess the efficacy of visualizations, with evaluation results demonstrating task performance accuracy and efficiency was improved through use of the portal.

Highlights

  • Authorship analysis is useful in any application context where authorship attribution is uncertain, unknown, or otherwise obfuscated

  • We demonstrate various case studies on how the system is of use, and present results of a preliminary user evaluation of the portal’s text visualization function

  • A review of recent improvements to authorship analysis on web contents reveals that improvements have been largely grounded in the development and use of writing style markers of electronic text, and in machine learning classification techniques adopted for authorship identification and similarity comparisons

Read more

Summary

Introduction

Authorship analysis is useful in any application context where authorship attribution is uncertain, unknown, or otherwise obfuscated. A review of recent improvements to authorship analysis on web contents reveals that improvements have been largely grounded in the development and use of writing style markers (features) of electronic text, and in machine learning classification techniques adopted for authorship identification and similarity comparisons. Recent years have seen the usage of statistical machine learning-based text analysis techniques grow in authorship analysis studies [4,5,7,12] Such techniques provide scalability and performance helpful when conducting analyses on web forum messages. The authorship analysis methods employed were similar to ones utilized in prior studies using supervised machine learning classifiers such as a multi-class decision tree and established stylometric identification feature sets encompassing lexical, syntactic, structural, and content-specific attributes [4,5,6,7]. The overall goal of the experiment was to evaluate the performance of the portal’s visualization functionalities, including feature highlighting on the message-level, the stylometric feature radar chart for author-level comparisons, and the stylometric heatmap found within the author-perspective

Experimental Setup
Performance Measures
Hypothesis Testing
Experimental Results
Discussion and implication
Findings
Conclusions and future work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.