Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders

Francisco J Ribadas-Pena,Víctor M Darriba Bilbao,Shuyuan Cao

doi:10.3390/math10162867

Francisco J Ribadas-Pena, Víctor M Darriba Bilbao + Show 1 more

Open Access

https://doi.org/10.3390/math10162867

Copy DOI

Journal: Mathematics	Publication Date: Aug 11, 2022
License type: cc-by

Affiliation: Universidade de Vigo

Abstract

In this paper, we introduce a multi-label lazy learning approach to deal with automatic semantic indexing in large document collections in the presence of complex and structured label vocabularies with high inter-label correlation. The proposed method is an evolution of the traditional k-Nearest Neighbors algorithm which uses a large autoencoder trained to map the large label space to a reduced size latent space and to regenerate the predicted labels from this latent space. We have evaluated our proposal in a large portion of the MEDLINE biomedical document collection which uses the Medical Subject Headings (MeSH) thesaurus as a controlled vocabulary. In our experiments we propose and evaluate several document representation approaches and different label autoencoder configurations.

Full Text