Abstract

Abstract Word length refers to a feature that is extracted from texts and used to characterize authorial style; it was quantitatively demonstrated by Mendenhall (Mendenhall, T. C., 1887, The characteristics curves of composition. Science, IX: 237–49). Many similar features for describing authorial style have been proposed; however, research indicates that compared with other features, word length identifies authors with lower accuracy. This study proposes a feature, referred to as c-wordL, to improve the accuracy of authorship attribution in texts through the classification of words into several types by following the part-of-speech (POS) tags and combining these types with the word length data. The proposed method was tested using 200 literary texts from ten different authors in Japanese, English, and Chinese. The results indicated that c-wordL was more accurate than the existing word length-based features and provided useful information that word unigrams and POS tag bigrams could not measure. In addition, the ease of interpretation of different types of features was discussed. In summary, c-wordL outperformed the existing superior features in explaining the distinct writing styles and identifying the authors.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.