Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Donghwa Kim,Deokseong Seo,Suhyoun Cho,Pilsung Kang

doi:10.1016/j.ins.2018.10.006

Abstract

The purpose of document classification is to assign the most appropriate label to a specified document. The main challenges in document classification are insufficient label information and unstructured sparse format. A semi-supervised learning (SSL) approach could be an effective solution to the former problem, whereas the consideration of multiple document representation schemes can resolve the latter problem. Co-training is a popular SSL method that attempts to exploit various perspectives in terms of feature subsets for the same example. In this paper, we propose multi-co-training (MCT) for improving the performance of document classification. In order to increase the variety of feature sets for classification, we transform a document using three document representation methods: term frequency–inverse document frequency (TF–IDF) based on the bag-of-words scheme, topic distribution based on latent Dirichlet allocation (LDA), and neural-network-based document embedding known as document to vector (Doc2Vec). The experimental results demonstrate that the proposed MCT is robust to parameter changes and outperforms benchmark methods under various conditions.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Abstract

Talk to us

Similar Papers

More From: Information Sciences

Lead the way for us

Journal: Information Sciences	Publication Date: Oct 11, 2018
Citations: 312

Similar Papers

FSELM: fusion semi-supervised extreme learning machine for indoor localization with Wi-Fi and Bluetooth fingerprints
Xinlong Jiang ... Junfa Liu
Soft Computing | VOL. 22
Xinlong Jiang, et. al.Xinlong Jiang ... Junfa Liu
06 Apr 2018
Soft Computing | VOL. 22

Semi-supervised learning for ordinal Kernel Discriminant Analysis
M Pérez-Ortiz ... C Hervás-Martínez
Neural Networks | VOL. 84
M Pérez-Ortiz, et. al.M Pérez-Ortiz ... C Hervás-Martínez
25 Aug 2016
Neural Networks | VOL. 84

A semi-supervised deep learning approach for cropped image detection
Israr Hussain ... Jiwu Huang
Expert Systems with Applications | VOL. 243
Israr Hussain, et. al.Israr Hussain ... Jiwu Huang
12 Dec 2023
Expert Systems with Applications | VOL. 243

Document representation methods for clustering bilingual documents
Shutian Ma ... Daqing He
Proceedings of the Association for Information Science and Technology | VOL. 53
Shutian Ma, et. al.Shutian Ma ... Daqing He
01 Jan 2015
Proceedings of the Association for Information Science and Technology | VOL. 53

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec

Abstract

Talk to us

Similar Papers

More From: Information Sciences