Abstract

Part-of-speech (POS) tags have been employed in automatic genre classification in that they do not ‘reflect the topic of the document, but rather the type of text used in the document’ and that their distribution has been observed to vary across different genres. The current study introduces a new set of linguistically fine-grained POS tags generated by AUTASYS for automatic genre classification. The experiment was designed to investigate the impact of the proposed feature set when compared and contrasted with word unigrams as a bag of words (BOW) and an impoverished POS tag set. Machine-learning tools were used to evaluate the classification performance in terms of F-score. The British component of the International Corpus of English was employed as a resource of different text genres. Ten different genre classification tasks were identified based on the existing British component of the International Corpus of English (ICE-GB) categories, which are grouped according to different granularities. As our results will show, the use of linguistically rich POS tags as discriminative features produces superior accuracy when compared with BOW for fine-grained genre classification. Our results will further demonstrate that the superior performance is due to the rich linguistic information since an impoverished tag set yielded worse classification results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.