Learning to Classify Documents According to Formal and Informal Style

Fadi Abu Sheikha,Diana Inkpen

doi:10.33011/lilt.v8i.1305

Abstract

This paper discusses an important issue in computational linguistics: classifying texts as formal or informal style. Our work describes a genre-independent methodology for building classifiers for formal and informal texts. We used machine learning techniques to do the automatic classification, and performed the classification experiments at both the document level and the sentence level. First, we studied the main characteristics of each style, in order to train a system that can distinguish between them. We then built two datasets: the first dataset represents general-domain documents of formal and informal style, and the second represents medical texts. We tested on the second dataset at the document level, to determine if our model is sufficiently general, and that it works on any type of text. The datasets are built by collecting documents for both styles from different sources. After collecting the data, we extracted features from each text. The features that we designed represent the main characteristics of both styles. Finally, we tested several classification algorithms, namely Decision Trees, Naïve Bayes, and Support Vector Machines, in order to choose the classifier that generates the best classification results.

Highlights

The need to identify and interpret possible differences in the linguistic style of texts–such as formal or informal–is increasing, as more people use the Internet as their main research resource
They indicate that best classifier of the three algorithms is the Decision Tree
The results show that Support Vector Machines (SVM) achieved the highest performance, and was the best classifier for our model

Summary

Introduction

The need to identify and interpret possible differences in the linguistic style of texts–such as formal or informal–is increasing, as more people use the Internet as their main research resource. The formal style is used in most writing and business situations, and when speaking to people with whom we do not have close relationships Some characteristics of this style are long words and using the passive voice. Informal style is mainly for casual conversation, like at home between family members, and is used in writing only when there is a personal or closed relationship, such as that of friends and family. Some characteristics of this style are word contractions such as “won’t”, abbreviations like “phone”, and short words (Park, 2007). A brief explanation of supervised learning for automatic classification follows

Methods

Results

Discussion

Conclusion