Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification

Alejandro Moreo,Andrea Pedrotti,Fabrizio Sebastiani

doi:10.1145/3544104

Abstract

Funnelling(Fun) is a recently proposed method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta-classifier that uses this vector as its input. The meta-classifier can thus exploit class-class correlations, and this (among other things) givesFunan edge over CLTC systems in which these correlations cannot be brought to bear. In this article, we describeGeneralized Funnelling(gFun), a generalization ofFunconsisting of an HTL architecture in which 1st-tier components can be arbitraryview-generating functions, i.e., language-dependent functions that each produce a language-independent representation (“view”) of the (monolingual) document. We describe an instance ofgFunin which the meta-classifier receives as input a vector of calibrated posterior probabilities (as inFun) aggregated to other embedded representations that embody other types of correlations, such as word-class correlations (as encoded byWord-Class Embeddings), word-word correlations (as encoded byMultilingual Unsupervised or Supervised Embeddings), and word-context correlations (as encoded bymultilingual BERT). We show that this instance ofgFunsubstantially improves overFunand over state-of-the-art baselines by reporting experimental results obtained on two large, standard datasets for multilingual multilabel text classification. Our code that implementsgFunis publicly available.

Full Text