Classifying Documents based on Formal and Informal Writing Styles using Machine Learning Algorithms

K M G S Karunarathna,B T G S Kumara,R A H M Rupasingha

doi:10.1109/icarc54489.2022.9753774

Abstract

With the advancement of the field of education, sufficient information needed for education and most important things has become available on the internet. Students and scholars need a variety of formal and informal documents for their education purpose but, a large amount of data make it difficult to filter useful information from the internet. Therefore, these documents need labels for students and scholars who are engage in education to use the documents efficiently. As a result, document classification helps to assign a label to formal and informal documents. Text classification according to formal and informal styles is challenging for obtaining good accuracy, as linguistics differences are rich. This article proposed a document classification method based on formal and informal styles. This experiment used 200 text documents that were collected from the web targeting main two categories formal documents and informal documents. After preprocessing the text documents extract the feature vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer, and they are converted into numerical representation for adoption to the training model. The classification used Decision Tree (J48), Random Forest, Multilayer Perception (MLP), and Support Vector Machine (SVM) and it is tested with 5 folds cross-validation. Based on the experiment results of four classification algorithms, it indicates that the proposed approach using a Random Forest algorithm can classify the data with 94.97% accuracy with height precision, recall, f-measure values, and lowest error when comparing with the other algorithms.

Full Text