머신러닝과 딥러닝 언어모델을 활용한 한국어 학습자 작문의 주제 자동 분류 연구

Jin Lee,Hansaem Kim

doi:10.17296/korbil.2024..96.163

Abstract

The purpose of this study is to explore the possibility of automatically classifying the topics of Korean language learners’ writings using machine learning and deep learning. The Random Forest model, serving as a baseline, achieved an accuracy of 96.5%. In contrast, compared to the baseline, the deep learning model KoBERT showed lower accuracy at 64.25%, while KoELECTRA slightly outperformed the baseline with 97.25% accuracy. When comparing the topic prediction results of the three models, KoBERT demonstrated prediction outcomes that deviated from human intuition, failing to accurately predict topics that were correctly identified by the other two models, as evidenced by its low accuracy. The Random Forest and KoELECTRA exhibited similar tendencies in terms of error patterns, with no significant difference in performance between the two algorithms. Common prediction errors across the three algorithms included difficulties in classifying writings that used general vocabulary instead of topic-specific terms. Additionally, the models often failed to predict the topic accurately when the content included vocabulary related to other topics. To improve performance, a detailed analysis of various writing genres and continuous experimentation using new data and methodologies are necessary

Full Text