Abstract

Positive and Unlabeled Learning (PUL) uses unlabeled documents and a few positive documents for retrieving a set of “interest” documents from a text collection. Usually, PUL approaches are based on the vector space model. However, when dealing with semi-supervised learning for text classification or information retrieval, graph-based approaches have been proved to outperform vector space model-based approaches. So, in this article, a graph-based approach for PUL is proposed: Label Propagation for Positive and Unlabeled Learning (LP-PUL). The proposed framework consists of three steps: (i) building a similarity graph, (ii) identifying reliable negative documents, and (iii) performing label propagation to classify the remaining unlabeled documents as positive or negative. We carried out experiments to measure the impact of the different choices in each step of the proposed framework. We also demonstrated that the proposal surpasses the classification performance of other PUL (RC-SVM, PU-LP, and PE-PUC) or one-class learning (k-NN-based, k-Means-based, and Dense Autoencoder) algorithms in terms of F1. Considering the best results of any algorithm used in the experimental evaluation, PU-PUL can improve the classification performance from 2%, when using only 1 labeled document, to 28%, when 30 labeled documents are employed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.