Text classification in low-resource languages like Khmer remains challenging due to linguistic complexity, limited annotated data, and noise from real-world applications. This study addresses these challenges by systematically comparing text embedding techniques for Khmer news classification. We evaluate traditional methods (TF-IDF with SVM) against state-of-the-art multilingual transformers (XLM-RoBERTa, LaBSE) using a self-collected dataset of 7,344 Khmer news articles across six categories—political, economic, entertainment, sport, technology, and life. The dataset intentionally retains noise (e.g., mixed-language text, unstructured formatting) to reflect practical scenarios. To address Khmer's lack of word boundaries, we employ word segmentation via khmer-nltk for traditional models, while transformer models leverage their inherent subword tokenization. Experiments reveal that transformer-based embeddings achieve superior performance, with XLM-RoBERTa and LaBSE attaining F1 scores of 94.2% and 94.3%, respectively, outperforming TF-IDF (93.3%). However, the "life" category proves challenging across all models (F1: 85.5–88.1%), likely due to semantic overlap with other categories. Our findings underscore the effectiveness of transformer architectures in capturing contextual nuances for low-resource languages, even with noisy data. This work offers insights for NLP researchers and practitioners, emphasizing the need for domain-specific adaptations and expanded datasets to improve performance in underrepresented languages.
Read full abstract