This article presents a new Human-steerable Topic Modeling (HSTM) technique. Unlike existing techniques commonly relying on matrix decomposition-based topic models, we extend LDA as the fundamental component for extracting topics. LDA's high popularity and technical characteristics, such as better topic quality and no need to cherry-pick terms to construct the document-term matrix, ensure better applicability. Our research revolves around two inherent limitations of LDA. First, the principle of LDA is complex. Its calculation process is stochastic and difficult to control. We thus give a weighting method to incorporate users' refinements into the Gibbs sampling to control LDA. Second, LDA often runs on a corpus with massive terms and documents, forming a vast search space for users to find semantically relevant or irrelevant objects. We thus design a visual editing framework based on the coherence metric, proven to be the most consistent with human perception in assessing topic quality, to guide users' interactive refinements. Cases on two open real-world datasets, participants' performance in a user study, and quantitative experiment results demonstrate the usability and effectiveness of the proposed technique.
Read full abstract