Abstract

In this work, we deploy a logistic regression classifier to ascertain whether a given document belongs to the fiction or non-fiction genre. For genre identification, previous work had proposed three classes of features, viz., low-level (character-level and token counts), high-level (lexical and syntactic information) and derived features (type-token ratio, average word length or average sentence length). Using the Recursive feature elimination with cross-validation (RFECV) algorithm, we perform feature selection experiments on an exhaustive set of nineteen features (belonging to all the classes mentioned above) extracted from Brown corpus text. As a result, two simple features viz., the ratio of the number of adverbs to adjectives and the number of adjectives to pronouns turn out to be the most significant. Subsequently, our classification experiments aimed towards genre identification of documents from the Brown and Baby BNC corpora demonstrate that the performance of a classifier containing just the two aforementioned features is at par with that of a classifier containing the exhaustive feature set.

Highlights

  • Texts written in any human language can be classified in various ways, one of them being fiction and non-fiction genres

  • We associate fiction writings with literary perspectives, i.e., an imaginative form of writing which has its own purpose of communication, whereas non-fiction writings are written in a matter-of-fact manner, but the contents may or may not refer to real life incidents (Lee, 2001)

  • One could use a software to identify news articles, which are expected to be written in a matter-of-fact manner, but tend to use an imaginative writing style to unfairly influence the reader

Read more

Summary

Introduction

Texts written in any human language can be classified in various ways, one of them being fiction and non-fiction genres These categories/genres can either refer to the actual content of the write-up or the writing style used, and in this paper, we use the latter meaning. One could use a software to identify news articles, which are expected to be written in a matter-of-fact manner, but tend to use an imaginative writing style to unfairly influence the reader. Another application for such a software could be for publishing houses which can use it to automatically filter out article/novel submissions that do not meet certain expected aspects of fiction writing style

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call