In information dissemination, media plays a vital role, a huge amount of information is spread quickly through social media and news platforms. The online news outlet and frequency of the huge number of headlines creation demand automatic classification tools. In this overwhelming domain, different countries explored classification algorithms for their news. However, very limited research is done on Pakistani news headlines in terms of classification. There is also a lack of a Pakistani news headlines benchmark dataset. This research is intended to find a suitable model to classify Pakistani news headlines automatically. Moreover, in this study, we designed a benchmark NCE-2D (News Classification and Emotion Detection Dataset) of 2429 Pakistani news headlines extracted from different news websites using ParseHub, which includes five types of categories. First, to minimize the noise from the dataset is undergoing some pre-processing steps including punctuation, stopwords, null entries, and duplication removal. In the next step, extract the feature by two different methods Count-Vectorizer and TF-IDF Vectorizer, and then apply a set of 10 different learning algorithms for both types of features extracted. The best performing algorithms for Pakistani news headlines out of the set of implemented algorithms by both combinations included Support Vector Machine (SVM) when implemented over the TF-IDF features and Multi-Layer Perception (MLP) when implemented over the CV features. Based on the results, both these algorithms are compared to choose the most suitable from them. The combination of SVM is investigated most appropriate classifier with 82.16% accuracy than 81.75% accuracy of MLP.
Read full abstract