Abstract

Cross-lingual text classification(CLTC) is the task of classifying documents written in different languages into the same taxonomy of categories. This paper presents a novel approach to CLTC that builds on model distillation, which adapts and extends a framework originally proposed for model compression. Using soft probabilistic predictions for the documents in a label-rich language as the (induced) supervisory labels in a parallel corpus of documents, we train classifiers successfully for new languages in which labeled training data are not available. An adversarial feature adaptation technique is also applied during the model training to reduce distribution mismatch. We conducted experiments on two benchmark CLTC datasets, treating English as the source language and German, French, Japan and Chinese as the unlabeled target languages. The proposed approach had the advantageous or comparable performance of the other state-of-art methods.

Highlights

  • The availability of massive multilingual data on the Internet makes cross-lingual text classification (CLTC) increasingly important

  • Many other languages do not necessarily have such rich amounts of labeled data. This leads to an open challenge in CLTC, i.e., how can we effectively leverage the trained classifiers in a label-rich source language to help the classification of documents in other label-poor target languages?

  • One branch of CLTC methods is to use lexical level mappings to transfer the knowledge from the source language to the target language

Read more

Summary

Introduction

The availability of massive multilingual data on the Internet makes cross-lingual text classification (CLTC) increasingly important. The task is defined as to classify documents in different languages using the same taxonomy of predefined categories. CLTC systems build on supervised machine learning require a sufficiently amount of labeled training data for every domain of interest in each language. In reality, labeled data are not evenly distributed among languages and across domains. For example, is a label-rich language in the domains of news stories, Wikipedia pages and reviews of hotels, products, etc. Many other languages do not necessarily have such rich amounts of labeled data. This leads to an open challenge in CLTC, i.e., how can we effectively leverage the trained classifiers in a label-rich source language to help the classification of documents in other label-poor target languages?

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call