Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis

Brian Dickinson,Michael Ganger,Wei Hu

doi:10.4236/jdaip.2015.34015

Brian Dickinson, Michael Ganger + Show 1 more

Open Access

https://doi.org/10.4236/jdaip.2015.34015

Copy DOI

Abstract

Social media platforms such as Twitter and the Internet Movie Database (IMDb) contain a vast amount of data which have applications in predictive sentiment analysis for movie sales, stock market fluctuations, brand opinion, or current events. Using a dataset taken from IMDb by Stanford, we identify some of the most significant phrases for identifying sentiment in a wide variety of movie reviews. Data from Twitter are especially attractive due to Twitter’s real-time nature through its streaming API. Effectively analyzing this data in a streaming fashion requires efficient models, which may be improved by reducing the dimensionality of input vectors. One way this has been done in the past is by using emoticons; we propose a method for further reducing these features through identifying common structure in emoticons with similar sentiment. We also examine the gender distribution of emoticon usage, finding tendencies towards certain emoticons to be disproportionate between males and females. Despite the roughly equal gender distribution on Twitter, emoticon usage is predominately female. Furthermore, we find that distributed vector representations, such as those produced by Word2Vec, may be reduced through feature selection. This analysis was done on a manually labeled sample of 1000 tweets from a new dataset, the Large Emoticon Corpus, which consisted of about 8.5 million tweets containing emoticons and was collecting over a five day period in May 2015. Additionally, using the common structure of similar emoticons, we are able to characterize positive and negative emoticons using two regular expressions which account for over 90% of emoticon usage in the Large Emoticon Corpus.

Highlights

In the past few years, the growth of social media platforms has brought with it a wealth of unprocessed data, containing interesting information that may be inferred from the interactions of users
This study introduced the Large Emoticon Corpus, which had provided insights into emoticon usage on Twitter, both in the frequency of individual emoticon use and usage based on user gender
We had shown that positive and negative emoticons might be deconstructed into more basic forms and potentially used to classify positive and negative sentiment by simple regular expressions

Summary

Introduction

In the past few years, the growth of social media platforms has brought with it a wealth of unprocessed data, containing interesting information that may be inferred from the interactions of users. The main difficulty with using these tweets to extract sentiment is that they lack direct labels which convey the overall feeling or leaning of the text This has led to many different ways to classify the tweets, involving both supervised and unsupervised machine learning. Analyzing Twitter using natural language processing introduces interesting problems; each tweet is limited to a maximum of 140 characters, meaning that normal grammatical and syntactical conventions are mostly ignored or circumvented. The consequence of this is that models of vocabulary and sentence structure must be rebuilt according to the unique style which has developed on Twitter. A section dictating the results of these operations is followed by concluding remarks and possible avenues of future research

Related Works

Sentiment Analysis

Large Emoticon Corpus

Stanford Large Movie Review Dataset

Textual Representation

Distributed Vector

Bag of Words

Feature Selection

Gender

Stanford IMDB

Bag-of-Words CDSSM DSSM Doc2Vec

Dimensionality Reduction of Distributed Vectors

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Data Analysis and Information Processing	Publication Date: Jan 1, 2015
Citations: 17	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Analysis and Information Processing

Lead the way for us

Similar Papers

Modelling sentiments based on objectivity and subjectivity with self-attention mechanisms
Hu Ng ... Vik Tor Goh
F1000Research | VOL. 10
Hu Ng, et. al.Hu Ng ... Vik Tor Goh
04 Oct 2021
F1000Research | VOL. 10

Modelling sentiments based on objectivity and subjectivity with self-attention mechanisms
Vik Tor Goh ... Hu Ng
F1000Research | VOL. 10
Vik Tor Goh, et. al.Vik Tor Goh ... Hu Ng
13 May 2022
F1000Research | VOL. 10

Modelling sentiments based on objectivity and subjectivity with self-attention mechanisms.
Hu Ng ... Timothy Tzen Vun Yap
F1000Research | VOL. 10
Hu Ng, et. al.Hu Ng ... Timothy Tzen Vun Yap
17 May 2022
F1000Research | VOL. 10

Klasifikasi Data Review IMDb Berdasarkan Analisis Sentimen Menggunakan Algoritma Support Vector Machine
Gita Cahyani ... Acihmah Sidauruk
Jurnal media informatika Budidarma | VOL. 6
Gita Cahyani, et. al.Gita Cahyani ... Acihmah Sidauruk
25 Jul 2022
Jurnal media informatika Budidarma | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dimensionality Reduction of Distributed Vector Word Representations and Emoticon Stemming for Sentiment Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Data Analysis and Information Processing