Abstract

The field of cyber forensics is an emerging one, and it looks to me that its scope can profit from some contribution. Criminals and terrorists tend to use online messaging services through the internet or various websites to commit illegal activities or actions that would put people life at risk. Online anonymity allows any internet user to message someone without any track back to him/her, especially if registered with false or fake names and other information. This paper proposes to establish a database for the identification of anonymous online message senders based on a representation of the aspects of the writing. The proposed database can be used in conjunction with other forensic tools to support the activity of a digital forensic investigator by generating ideas and hypotheses about online anonymous message senders. It hopes that developing such a database will not only help identify anonymous online message senders so that they can be traced or prosecuted but will also result in a database containing information which may be fairly useful in solving different online criminal activities. To test the applicability of the proposed database, the author designed a simple database of 221 participants under extreme conditions, such as short sample sizes about 2 to 3 lines/35-40 words long, single topic and genre data sets, and large number of participants. The author used syntactic, word-based, and character based identifiers to represent and define the linguistic profile of each participant. The author also experimented with various data analytical and adjustment methods, length adjustment, standardization, dimensionality reduction, clustering models. For style identifiers selection, the author applied term-frequency IDF, or TF.IDF technique except when centroid analysis was used. Further, the author validated the test results at each stage of the analysis, and found that the stylometric test was able to identify authorship of anonymous online messages with an accuracy of 60% for function word usages and 50% for parts of speech frequencies. Although the result doesn’t enable a persuasive support for the proposed database, there still is a need for more thorough testing with an expanded profile size that contains more than 100 words long for each participant.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call