Abstract
Background. Topic modelling is a method of automated probabilistic detection of topics in a text collection. Use of topic modelling for short texts, e.g. tweets or search engine queries, is complicated due to their short length and grammatical flaws, including broken word order, abbreviations, and contamination of different languages. At the same time, as our research shows, human coding cannot be perceived as a baseline for topic quality assessment. Objectives. We use biterm topic model (BTM) to test the relations between two topic quality metrics independent from topic coherence with the human topic interpretability. Topic modelling is applied to three cases of conflictual Twitter discussions in three different languages, namely the Charlie Hebdo shooting (France), the Ferguson unrest (the USA), and the anti-immigrant bashings in Biryulevo (Russia), which represent, respectively, a global multilingual, a large monolingual, and a mid-range monolingual type of discussions. Method. First, we evaluate the human baseline coding by providing evidence for the Russian case on the coding by two pairs of coders who have varying levels of knowledge of the case. We then measure the quality of modelling on the level of topics by looking at topic interpretability (by experienced coders), topic robustness, and topic saliency. Results. The results of the experiment show that: 1) the idea of human coding as baseline needs to be rejected; 2) topic interpretability, robustness, and saliency can be inter-related; 3) the multilingual discussion performs better than the monolingual ones in terms of interdependence of the metrics. Conclusion. We formulate the idea of an ‘ideal topic’ that rethinks the goal of topic modelling towards finding a smaller number of good topics rather instead of maximization of the number of interpretable topics.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.