The Impact of Data Preparation and Model Complexity on the Natural Language Classification of Chinese News Headlines

Torrey Wagner,Dennis Guhl,Brent Langhals

doi:10.3390/a17040132

Torrey Wagner, Dennis Guhl + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/a17040132

Copy DOI

Export

Save

Cite

Journal: Algorithms	Publication Date: Mar 22, 2024
Citations: 2	License type: CC BY 4.0

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Given the emergence of China as a political and economic power in the 21st century, there is increased interest in analyzing Chinese news articles to better understand developing trends in China. Because of the volume of the material, automating the categorization of Chinese-language news articles by headline text or titles can be an effective way to sort the articles into categories for efficient review. A 383,000-headline dataset labeled with 15 categories from the Toutiao website was evaluated via natural language processing to predict topic categories. The influence of six data preparation variations on the predictive accuracy of four algorithms was studied. The simplest model (Naïve Bayes) achieved 85.1% accuracy on a holdout dataset, while the most complex model (Neural Network using BERT) demonstrated 89.3% accuracy. The most useful data preparation steps were identified, and another goal examined the underlying complexity and computational costs of automating the categorization process. It was discovered the BERT model required 170x more time to train, was slower to predict by a factor of 18,600, and required 27x more disk space to save, indicating it may be the best choice for low-volume applications when the highest accuracy is needed. However, for larger-scale operations where a slight performance degradation is tolerated, the Naïve Bayes algorithm could be the best choice. Nearly one in four records in the Toutiao dataset are duplicates, and this is the first published analysis with duplicates removed.

Full Text