Abstract

Cross-domain text classification aims to automatically train a precise text classifier for a target domain by using labeled text data from a related source domain. To this end, the distribution gap between different domains has to be reduced. In previous works, a certain number of shared latent features (e.g., latent topics, principal components, etc.) are extracted to represent documents from different domains, and thus reduce the distribution gap. However, only relying the shared latent features as the domain bridge may limit the amount of knowledge transferred. This limitation is more serious when the distribution gap is so large that only a small number of latent features can be shared between domains. In this paper, we propose a novel approach named Topic Correlation Analysis (TCA), which extracts both the shared and the domain-specific latent features to facilitate effective knowledge transfer. In TCA, all word features are first grouped into the shared and the domain-specific topics using a joint mixture model. Then the correlations between the two kinds of topics are inferred and used to induce a mapping between the domain-specific topics from different domains. Finally, both the shared and the mapped domain-specific topics are utilized to span a new shared feature space where the supervised knowledge can be effectively transferred. The experimental results on two real-world data sets justify the superiority of the proposed method over the stat-of-the-art baselines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call