Mining an English-Chinese parallel Dataset of Financial News

Nicolas Turenne,Guitao Fan,Siyuan Wang,Jianlong Li,Ziwei Chen,Jiaqi Zhou,Yiwen Li

doi:10.5334/johd.62

Abstract

Parallel text datasets are a valuable for educational purposes, machine translation, and cross-language information retrieval, but few are domain-oriented. We have created a Chinese–English parallel dataset in the domain of finance technology, using the <em>Financial Times</em> website, from which we grabbed 60,473 news items from between 2007 and 2021. This dataset is a bilingual Chinese–English parallel dataset of news in the domain of finance. It is open access in its original state without transformation, and has been made not for machine translation as has been used, but for intelligent mining, in which we conducted many experiments using up-to-date text mining techniques: clustering (topic modeling, community detection, <em>k</em>-means), topic prediction (naive Bayes, SVM, LSTM, Bert), and pattern discovery (dictionary based, time series). We present the usage of these techniques as a framework for other studies, not only as an application but with an interpretation.

Full Text