Abstract

Topic modeling extracts useful potential topics that reflect market information from massive financial news and is widely used in data mining and economic research. Traditional topic modeling approaches such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) lack semantic information, and short texts have feature sparse problems. We develop a topic clustering model based on BERT-LDA joint embedding that takes both contextual semantics and thematic narrative into account. We cluster document embeddings with the HDBSCAN algorithm and utilize a class-based TF-IDF (c-TF-IDF) method to create topic representations. Empirical results show that the BERT-LDA model is competitive compared with traditional and single topic models. It generates coherent topic words that are dissimilar to each other.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call