Transductive transfer learning based Genetic Programming for balanced and unbalanced document classification using different types of features

Wenlong Fu,Bing Xue,Xiaoying Gao,Mengjie Zhang

doi:10.1016/j.asoc.2021.107172

Abstract

Document classification is one of the predominant tasks in Natural Language Processing. However, some document classification tasks do not have ground truth while other similar datasets may have ground truth. Transfer learning can utilize similar datasets with ground truth to train effective classifiers on the dataset without ground truth. This paper introduces a transductive transfer learning method for document classification using two different text feature representations—the term frequency (TF) and the semantic feature doc2vec. It has three main contributions. First, it enables the sharing knowledge in a dataset using TF and a dataset using doc2vec in transductive transfer learning for performance improvement. Second, it demonstrates that the partially learned programs from TFs and from doc2vecs can be alternatively used to “label then learn” and they improve each other. Lastly, it addresses the unbalanced dataset problem by considering the unbalanced distributions on categories for evolving proper Genetic Programming (GP) programs on the target domains. Our experimental results on two popular document datasets show that the proposed technique effectively transfers knowledge from the GP programs evolved from the source domains to the new GP programs on the target domains using TF or doc2vec. There are obviously more than 10 percentages improvement achieved by the GP programs evolved by the proposed method over the GP programs directly evolved from the source domains. Also, the proposed technique effectively utilizes GP programs evolved from unbalanced datasets (on the source and target domains) to evolve new GP programs on the target domains, which balances predictions on different categories.

Full Text