New metrics and tests for subject prevalence in documents based on topic modeling

Louisa Kontoghiorghes,Ana Colubi

doi:10.1016/j.ijar.2023.02.009

Abstract

The aim is to introduce a metric to quantify the relevance of specific subjects within a text and develop a methodology to test whether this relevance is the same or not in various written documents. The proposed metric can be used to track the evolution of a subject in a series of documents or to measure the impact of a given text in related literature. To this aim, text mining tools are combined with Bayesian and frequentist statistical methods innovatively. First, topic modeling based on state-of-the-art techniques is suggested to be employed to identify relevant topics. The derived models are used to quantify the relative importance of a subject defined through a given set of terms, or keywords, by employing Bayesian techniques. Then, a two-sample test statistic is proposed to compare subjects' prevalence in two groups of documents. Given the complexity of the involved parametric distributions, a distribution-free bootstrap approach is suggested. The rationale of the approach will be established. The correctness and consistency of the proposed test are analyzed through simulations. The methodology is used to assess the impact of the EU investment through a project on the related scientific production and for sentiment analysis.1

Full Text