C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Geert Heyman,Ivan Vulić,Marie-Francine Moens

doi:10.1007/s10618-015-0442-x

Abstract

We study the problem of extracting cross-lingual topics from non-parallel multilingual text datasets with partially overlapping thematic content (e.g., aligned Wikipedia articles in two different languages). To this end, we develop a new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model (BiLDA), does not assume the availability of document pairs with identical topic distributions. We present a full overview of C-BiLDA, and show its utility in the task of cross-lingual knowledge transfer for multi-class document classification on two benchmarking datasets for three language pairs. The proposed model outperforms the baseline LDA model, as well as the standard BiLDA model and two standard low-rank approximation methods (CL-LSI and CL-KCCA) used in previous work on this task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Data Mining and Knowledge Discovery	Publication Date: Nov 13, 2015
Citations: 17	License type: mit

R Discovery Prime

R Discovery Prime

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Abstract

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery

Lead the way for us

Similar Papers

An Improved LDA Model for Academic Document Analysis
Yuyan Jiang ... Ping Li
Journal of Software | VOL. 9
Yuyan Jiang, et. al.Yuyan Jiang ... Ping Li
10 Jan 2014
Journal of Software | VOL. 9

Learning to bridge colloquial and formal language applied to linking and search of E-Commerce data
Ivan Vulić ... Susana Zoghbi
-
Ivan Vulić, et. al.Ivan Vulić ... Susana Zoghbi
03 Jul 2014
03 Jul 2014

An Unsupervised Framework With Attention Mechanism and Embedding Perturbed Encoder for Non-Parallel Text Sentiment Style Transfer
Yuanzhi Liu ... Qingqing Yang
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 31
Yuanzhi Liu, et. al.Yuanzhi Liu ... Qingqing Yang
01 Jan 2023
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 31

Mask and Infill: Applying Masked Language Model for Sentiment Transfer
Xing Wu ... Tao Zhang
-
Xing Wu, et. al.Xing Wu ... Tao Zhang
01 Aug 2019
01 Aug 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Abstract

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery