Abstract

Positive and Unlabeled Learning (PUL) uses unlabeled documents and a few positive documents for retrieving a set of “interest” documents from a text collection. Usually, PUL approaches are based on the vector space model. However, when dealing with semi-supervised learning for text classification or information retrieval, graph-based approaches have been proved to outperform vector space model-based approaches. So, in this article, a graph-based approach for PUL is proposed: Label Propagation for Positive and Unlabeled Learning (LP-PUL). The proposed framework consists of three steps: (i) building a similarity graph, (ii) identifying reliable negative documents, and (iii) performing label propagation to classify the remaining unlabeled documents as positive or negative. We carried out experiments to measure the impact of the different choices in each step of the proposed framework. We also demonstrated that the proposal surpasses the classification performance of other PUL (RC-SVM, PU-LP, and PE-PUC) or one-class learning (k-NN-based, k-Means-based, and Dense Autoencoder) algorithms in terms of F1. Considering the best results of any algorithm used in the experimental evaluation, PU-PUL can improve the classification performance from 2%, when using only 1 labeled document, to 28%, when 30 labeled documents are employed.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call