Financial Topic Modeling Based on the BERT-LDA Embedding

Mei Zhou,Ying Kong,Jianwu Lin

doi:10.1109/indin51773.2022.9976145

Abstract

Topic modeling extracts useful potential topics that reflect market information from massive financial news and is widely used in data mining and economic research. Traditional topic modeling approaches such as Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) lack semantic information, and short texts have feature sparse problems. We develop a topic clustering model based on BERT-LDA joint embedding that takes both contextual semantics and thematic narrative into account. We cluster document embeddings with the HDBSCAN algorithm and utilize a class-based TF-IDF (c-TF-IDF) method to create topic representations. Empirical results show that the BERT-LDA model is competitive compared with traditional and single topic models. It generates coherent topic words that are dissimilar to each other.

Full Text