Abstract

High-level abstraction, for example, semantic representation, is vital for document classification and retrieval. However, how to learn document semantic representation is still a topic open for discussion in information retrieval and natural language processing. In this paper, we propose a new Hybrid Deep Belief Network (HDBN) which uses Deep Boltzmann Machine (DBM) on the lower layers together with Deep Belief Network (DBN) on the upper layers. The advantage of DBM is that it employs undirected connection when training weight parameters which can be used to sample the states of nodes on each layer more successfully and it is also an effective way to remove noise from the different document representation type; the DBN can enhance extract abstract of the document in depth, making the model learn sufficient semantic representation. At the same time, we explore different input strategies for semantic distributed representation. Experimental results show that our model using the word embedding instead of single word has better performance.

Highlights

  • Semantic representation [1,2,3] is very important in document classification and document retrieval tasks

  • Considering the limitations of DBN and DBM, especially for document representation, in this paper, we propose Hybrid Deep Belief Network (HDBN) which uses the Deep Boltzmann Machines model composed of simple twolayer Restricted Boltzmann Machines (RBMs) in the lower layers and Deep Belief Networks model made up of two-layer RBMs in the upper layers as we take both training time and the model accuracy into consideration for document classification and retrieval tasks

  • We explored the effects of different input on our HDBN model for extracting semantic information

Read more

Summary

Introduction

Semantic representation [1,2,3] is very important in document classification and document retrieval tasks. LSI [4] and pLSI [5] are two kinds of dimension reduction methods which use SVD (Singular Value Decomposition) to operate on a document vector matrix and remap it in a smaller semantic space than the original one This method can still only capture very limited relations between words. Blei et al [6] proposed Latent Dirichlet Allocation (LDA) that can extract some document topics which has shown superior performance over LSI and pLSI. This method is popular in the field of topic model; in the meantime, it is considered a great method for reducing dimensions. This method has some disadvantages: semantic features of the study are not sufficient for the documents, exact inferences in the directed model are intractable [7, 8], and it cannot properly deal with documents of different lengths

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.