Abstract

This study aimed to determine the number of documents suitable for LDA topic modeling analysis. Sample data was created from 7,115 news articles covering the high school credit system until 2022, following the announcement of the introduction of the high school credit system. Four different methods were employed for analysis. First, a total of 120 pieces of sample data, comprising 6 types with 20 pieces, were created, and the analyzed topics and concordance were examined for all documents. Second, through the AUC of the ROC curve, the discriminative power of all documents and the analysis of the same topic variables based on the number of documents were investigated. Third, the total document topics and the frequency of the same topic were analyzed in relation to the number of documents. Fourth, after analyzing the topics within the entire document, sample data reflecting the document ratio and weight by topic were created and compared based on the number of documents. The findings of this study indicate that a minimum of approximately 700 documents is required for robust LDA topic modeling analysis. Moreever, the analysis suggests that collecting over 2,000 documents provides sufficient data for reliable results.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call