An Empirical Study on the Classification of Chinese News Articles by Machine Learning and Deep Learning Techniques

Chuen-Min Huang,Yi-Jun Jiang

doi:10.1109/icmlc48188.2019.8949309

Abstract

This study compares Chinese news classification results of machine learning (ML) and deep learning (DL). In processing ML, we chose Support Vector Machine (SVM) and Naive Bayes (NB) to form three models: Word2Vec-SVM, TFIDF-SVM, and TFIDF-NB. Since NB assumes that the words are independent, this is different from the concept of related word distribution in Word2Vec, so the combination with NB is excluded. In processing DL, we adopted Bidirectional Long Short-Term Memory (Bi-LSTM), Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN), and used Word2Vec for word embedding. Experimental results showed that with proper word preprocessing, the difference of classification accuracy of ML and DL models is actually very small. Although the results show that Bi-LSTM performs the most accurate and has the lowest Loss compared to other DL techniques, its implementation process is the most time consuming. This study affirms the excellent results of CNN, while its Loss is the highest of the DL models. We also found that Word2Vec-SVM was superior to TFIDF-SVM in terms of efficiency, but its accuracy is not as good as expected. To summarize the classification accuracy in Bi-LSTM, LSTM, CNN, Word2vec-SVM, TFIDF-SVM, and NB are 89.3%, 88%, and 87.54%, 85.32%, 87.35%, 86.56%, respectively.

Full Text