A Method of Short Text Representation Based on the Feature Probability Embedded Vector.

Zhou Zhou,Wang Wang,Sun Sun

doi:10.3390/s19173728

Zhou Zhou, Wang Wang + Show 1 more

Open Access

https://doi.org/10.3390/s19173728

Copy DOI

Journal: Sensors (Basel, Switzerland)	Publication Date: Aug 28, 2019
Citations: 21	License type: CC BY 4.0

Affiliation: Northeast Normal University

Abstract

Text representation is one of the key tasks in the field of natural language processing (NLP). Traditional feature extraction and weighting methods often use the bag-of-words (BoW) model, which may lead to a lack of semantic information as well as the problems of high dimensionality and high sparsity. At present, to solve these problems, a popular idea is to utilize deep learning methods. In this paper, feature weighting, word embedding, and topic models are combined to propose an unsupervised text representation method named the feature, probability, and word embedding method. The main idea is to use the word embedding technology Word2Vec to obtain the word vector, and then combine this with the feature weighted TF-IDF and the topic model LDA. Compared with traditional feature engineering, the proposed method not only increases the expressive ability of the vector space model, but also reduces the dimensions of the document vector. Besides this, it can be used to solve the problems of the insufficient information, high dimensions, and high sparsity of BoW. We use the proposed method for the task of text categorization and verify the validity of the method.

Highlights

Massive amounts of data are created on the Internet every day
latent Dirichlet allocation (LDA),topic, if we find the words with higher probability and print them out
In order to focus on the text representation method, this paper adopts the experiment of single label classification

Summary

Introduction

Massive amounts of data are created on the Internet every day. It is clearly the best choice to process these data with computers. Transforming these texts written with natural languages into the forms that computers can understand has been one of the core goals of NLP. It is time consuming and labor intensive to label data sets, which generally requires a great deal of much more manual work. Unsupervised text representation is becoming becomes more and more practical. Many tasks need it, such as text classification and sentiment analysis [1,2,3]

Objectives

Methods

Results

Discussion

Conclusion