Abstract

This paper discusses an important issue in computational linguistics: classifying texts as formal or informal style. Our work describes a genre-independent methodology for building classifiers for formal and informal texts. We used machine learning techniques to do the automatic classification, and performed the classification experiments at both the document level and the sentence level. First, we studied the main characteristics of each style, in order to train a system that can distinguish between them. We then built two datasets: the first dataset represents general-domain documents of formal and informal style, and the second represents medical texts. We tested on the second dataset at the document level, to determine if our model is sufficiently general, and that it works on any type of text. The datasets are built by collecting documents for both styles from different sources. After collecting the data, we extracted features from each text. The features that we designed represent the main characteristics of both styles. Finally, we tested several classification algorithms, namely Decision Trees, Naïve Bayes, and Support Vector Machines, in order to choose the classifier that generates the best classification results.

Highlights

  • The need to identify and interpret possible differences in the linguistic style of texts–such as formal or informal–is increasing, as more people use the Internet as their main research resource

  • They indicate that best classifier of the three algorithms is the Decision Tree

  • The results show that Support Vector Machines (SVM) achieved the highest performance, and was the best classifier for our model

Read more

Summary

Introduction

The need to identify and interpret possible differences in the linguistic style of texts–such as formal or informal–is increasing, as more people use the Internet as their main research resource. The formal style is used in most writing and business situations, and when speaking to people with whom we do not have close relationships Some characteristics of this style are long words and using the passive voice. Informal style is mainly for casual conversation, like at home between family members, and is used in writing only when there is a personal or closed relationship, such as that of friends and family. Some characteristics of this style are word contractions such as “won’t”, abbreviations like “phone”, and short words (Park, 2007). A brief explanation of supervised learning for automatic classification follows

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call