Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation.

Chao Wei,Senlin Luo,Limin Pan,Ji Zhang,Hao Ren,Xincheng Ma

doi:10.1371/journal.pone.0146672

Abstract

Topic models and neural networks can discover meaningful low-dimensional latent representations of text corpora; as such, they have become a key technology of document representation. However, such models presume all documents are non-discriminatory, resulting in latent representation dependent upon all other documents and an inability to provide discriminative document representation. To address this problem, we propose a semi-supervised manifold-inspired autoencoder to extract meaningful latent representations of documents, taking the local perspective that the latent representation of nearby documents should be correlative. We first determine the discriminative neighbors set with Euclidean distance in observation spaces. Then, the autoencoder is trained by joint minimization of the Bernoulli cross-entropy error between input and output and the sum of the square error between neighbors of input and output. The results of two widely used corpora show that our method yields at least a 15% improvement in document clustering and a nearly 7% improvement in classification tasks compared to comparative methods. The evidence demonstrates that our method can readily capture more discriminative latent representation of new documents. Moreover, some meaningful combinations of words can be efficiently discovered by activating features that promote the comprehensibility of latent representation.

Highlights

The performance of document analysis and processing systems based on machine learning methods, such as classification[1][2], clustering[3][4], content analysis[5], textual similarity[6], and statistical machine translation (SMT)[7], is heavily dependent on the level of document representation (DR), as different representations may capture and disentangle different degrees of explanatory ingredients hidden in the documents[8]
Supposing that such latent document representation is strongly dependent on its neighbors, from the view of bag of words model, we first represent each document in the forms of a count vector, and select the discriminative neighbors set with Euclidean distance in the observation space
We proposed a semi-supervised manifold-inspired method, namely, the locally embedding autoencoder (LEAE), for document representation

Summary

Introduction

The performance of document analysis and processing systems based on machine learning methods, such as classification[1][2], clustering[3][4], content analysis[5], textual similarity[6], and statistical machine translation (SMT)[7], is heavily dependent on the level of document representation (DR), as different representations may capture and disentangle different degrees of explanatory ingredients hidden in the documents[8]. Neural networks can capture meaningful latent document representations (i.e., distributed representations) with deep learning techniques, including autoencoders[16], restricted Boltzmann machines (RBMs) [17], neural topic models (NTMs)[18] and document neural autoregressive distribution estimators (DocNADEs)[19] These methods use the word count vector as input and synthesize the input through different hidden layers of various deep neural networks. Topic models and neural networks are embedded with latent factors or topics, preserving the salient statistical structure of intra-documents[19] They represent an improvement for DR, such methods take a global perspective on document space as Euclidean, assuming that all documents are non-discriminatory and indicating that the latent representation is dependent on all other documents. Better representation of the latent document semantics depends on modeling the local document relationship within a neighborhood

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS ONE	Publication Date: Jan 19, 2016
Citations: 25	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Evaluation of structural integrity of railway bridge using acceleration data and semi-supervised learning approach
Jun S Lee ... Hyun Min Lee
Engineering Structures | VOL. 239
Jun S Lee, et. al.Jun S Lee ... Hyun Min Lee
17 Apr 2021
Engineering Structures | VOL. 239

A novel self-training semi-supervised deep learning approach for machinery fault diagnosis
Jianyu Long ... Chuan Li
International Journal of Production Research | VOL. 61
Jianyu Long, et. al.Jianyu Long ... Chuan Li
12 Feb 2022
International Journal of Production Research | VOL. 61

SemiContour: A Semi-supervised Learning Approach for Contour Detection.
Zizhao Zhang ... Xiaoshuang Shi
Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition | VOL. 2016
Zizhao Zhang, et. al.Zizhao Zhang ... Xiaoshuang Shi
01 Jun 2016
Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition | VOL. 2016

MUSCLE: Strengthening Semi-Supervised Learning Via Concurrent Unsupervised Learning Using Mutual Information Maximization
Hanchen Xie ... Aram Galstyan
-
Hanchen Xie, et. al.Hanchen Xie ... Aram Galstyan
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE