Abstract

Stylometric analysis of texts relies on learning characteristic traits of writing styles for authors. Once these patterns are discovered, they can be compared to the ones present in other text samples, to recognise their authorship. This recognition can be compromised if input datasets are prepared without taking into consideration possible stratification of the input space, leading to specific grouping of datapoints, or sub-classes within distinguished classes. The paper shows research dedicated to construction of various structures of input datasets, and combinations of such structures between train and test sets. In the research the influence of different stratification forms on the performance of selected popular classification systems was observed. To minimise the number of influencing factors, a task of authorship attribution was performed as binary classification with balanced classes. Stylometric descriptors exploited belonged to lexical and syntactic group, giving frequencies of occurrence for chosen style-markers. It resulted in real-valued attributes and these values were explored without applying discretisation, in order to avoid the possible bias of this procedure on observations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.