An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level

Vivian Oloo Vivian Oloo,Calvins Otieno Calvins Otieno,Lilian D Wanzare Lilian D Wanzare

doi:10.32628/cseit228617

Abstract

Writing style change detection models focus on determining the number of authors of documents with or without known authors. Determining the exact number of authors contributing in writing a document particularly when the authors contribute short texts in form of a sentence is still challenging because of the lack of standardized feature sets able to discriminate between the works of authors. Therefore, the task of identifying the best feature set for all the tasks of the writing style change detection is still considered important. This paper sought to determine the best feature set for the writing style change detection tasks; separating documents with several style changes (multi-authorship) from documents without any style changes (single-authorship), and determining the number and location of style changes in the case of multi-authorship. We performed exploratory research on existing stylometric features to determine the best document level and sentence level features. Document level features were extracted and used to separate single authored from multi-authored documents, while sentence level features were used to answer the question of determining the number of style changes To answer this question, we trained a random forest classifier to rank document level features and sentence level features separately, and applied an ablation test on the top 15 sentence level features using k-means clustering algorithm to confirm the effect of these features on model performance. The study found out that the best document level feature set for separating documents with and without style change was provided by an ensemble of features including number of sentence repetitions (num_sentence_repetitions) as the most determinant feature, 5-grams, 4-grams, Special_character, sentence_begin_lower, sentence_begin_upper, diversity, automated_readability_index, parenthesis_count, first_word_uppercase, lensear_write_formula, dale_chall_readability, difficult_words, type_token_ratio. These were the top ranked features in experiment one. On the other hand, the top fifteen sentence level features based on feature ranks using random forest classifier were diversity, dale_chall_readability grade, check_available_vowel, flesch_kincaid grade, parenthesis_count, colon_count, verbs, bigrams, alphabets, personal pronouns, coordinating conjunctions, interjections, modals, type_token ratio and punctuations_count. Consequently, the optimal feature set for determining the number of style changes in documents was considered based on the results of the ablation study on the top fifteen sentence level features, and was provided by an ensemble of features including personal pronouns, check_available_vowels, punctuations_counts, parenthesis count, coordinating conjunctions and colon count.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level

Abstract

Talk to us

Similar Papers

More From: International Journal of Scientific Research in Computer Science, Engineering and Information Technology

Lead the way for us

Journal: International Journal of Scientific Research in Computer Science, Engineering and Information Technology	Publication Date: Nov 15, 2022
License type: cc-by

Similar Papers

Advanced Feature-Driven Disease Named Entity Recognition Using Conditional Random Fields
Hidayat Rahman ... Richard Segall
-
Hidayat Rahman, et. al.Hidayat Rahman ... Richard Segall
02 Oct 2016
02 Oct 2016

Selective Expression For Event Coreference Resolution on Twitter
Wenhan Chao ... Xiao Liu
-
Wenhan Chao, et. al.Wenhan Chao ... Xiao Liu
01 Jul 2019
01 Jul 2019

Evaluating the performance of sentence level features and domain sensitive features of product reviews on supervised sentiment analysis tasks
Bagus Setya Rintyarna ... Riyanarto Sarno
Journal of Big Data | VOL. 6
Bagus Setya Rintyarna, et. al.Bagus Setya Rintyarna ... Riyanarto Sarno
12 Sep 2019
Journal of Big Data | VOL. 6

Topic model for long document extractive summarization with sentence-level features and dynamic memory unit
Chunlong Han ... Haotian Qi
Expert Systems with Applications | VOL. 238
Chunlong Han, et. al.Chunlong Han ... Haotian Qi
06 Oct 2023
Expert Systems with Applications | VOL. 238

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Optimal Feature Set for Stylometry-based Style Change detection at Document and Sentence Level

Abstract

Talk to us

Similar Papers

More From: International Journal of Scientific Research in Computer Science, Engineering and Information Technology