Abstract

The World Wide Web is increasing tremendously with massive amount of textual content primarily through social media sites. Most of the users are not interested to upload their genuine details along with textual content to these sites. To identify the correct information of the authors the researchers started a new research area named as Authorship Analysis. The authorship Analysis is used to find the details of the authors by examining their text. Authorship Profiling is one type of Authorship Analysis, which is used to detect the demographic characteristics like Age, Gender, Location, Educational Background, Nativity Language and Personality Traits of the authors by examining writing skills in their written text. Stylometry is one research area defines a set of stylometric features namely word based, character based, syntactic, structural and content based features for differentiating the author’s writing styles. In this work, the experimentation conducted with various stylistic features, N-grams and content based features for gender prediction. These features are used for representing the vectors of documents. The classification algorithms produce the model by processing these vectors. Two classification algorithms namely Random Forest, Naïve Bayes Multinomial were used for classification. We concentrated on prediction of Gender from 2019 Pan Competition Twitter dataset. Our approach obtained best accuracies when compared with many Authorship Profiling approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call