Integrating rich document representations for text classification

Suqi Jiang,Jason Lewris,Michael Voltmer,Hongning Wang

doi:10.1109/sieds.2016.7489319

Abstract

This paper involves deriving high quality information from unstructured text data through the integration of rich document representations to improve machine learning text classification problems. Previous research has applied Neural Network Language Models (NNLMs) to document classification performance, and word vector representations have been used to measure semantics among text. Never have they been combined together and shown to have improved text classification performance. Our belief is that the inference and clustering abilities of word vectors coupled with the power of a neural network can create more accurate classification predictions. The first phase our work focused on word vector representations for classification purposes. This approach included analyzing two distinct text sources with pre-marked binary outcomes for classification, creating a benchmark metric, and comparing against word vector representations within the feature space as a classifier. The results showed promise, obtaining an area under the curve of 0.95 utilizing word vectors, relative to the benchmark case of 0.93. The second phase of the project focused on utilizing an extension of the neural network model used in phase one to represent a document in its entirety as opposed to being represented word by word. Preliminary results indicated a slight improvement over the baseline model of approximately 2–3 percent.

Full Text