Abstract

There is often the need to perform sentiment classification in a particular domain where no labeled document is available. Although we could make use of a general-purpose off-the-shelf sentiment classifier or a pre-built one for a different domain, the effectiveness would be inferior. In this paper, we explore the possibility of building domain-specific sentiment classifiers with unlabeled documents only. Our investigation indicates that in the word embeddings learned from the unlabeled corpus of a given domain, the distributed word representations (vectors) for opposite sentiments form distinct clusters, though those clusters are not transferable across domains. Exploiting such a clustering structure, we are able to utilize machine learning algorithms to induce a quality domain-specific sentiment lexicon from just a few typical sentiment words (“seeds”). An important finding is that simple linear model based supervised learning algorithms (such as linear SVM) can actually work better than more sophisticated semi-supervised/transductive learning algorithms which represent the state-of-the-art technique for sentiment lexicon induction. The induced lexicon could be applied directly in a lexicon-based method for sentiment classification, but a higher performance could be achieved through a two-phase bootstrapping method which uses the induced lexicon to assign positive/negative sentiment scores to unlabeled documents first, a nd t hen u ses those documents found to have clear sentiment signals as pseudo-labeled examples to train a document sentiment classifier v ia supervised learning algorithms (such as LSTM). On several benchmark datasets for document sentiment classification, our end-to-end pipelined approach which is overall unsupervised (except for a tiny set of seed words) outperforms existing unsupervised approaches and achieves an accuracy comparable to that of fully supervised approaches.

Highlights

  • Sentiment analysis (Liu, 2015) is a popular research topic which has a wide range of applications, such as summarizing customer reviews, monitoring social media, and predicting stock market trends (Bollen et al, 2011)

  • How far can we go in sentiment classification for a new domain, given only unlabeled data? This paper presents our exploration towards answering the above research question

  • We have formulated the cluster hypothesis for sentiment analysis and verified that in general it holds for word embeddings within a specific domain but not across domains

Read more

Summary

Introduction

Sentiment analysis (Liu, 2015) is a popular research topic which has a wide range of applications, such as summarizing customer reviews, monitoring social media, and predicting stock market trends (Bollen et al, 2011). There have been some studies on domain adaptation or transfer learning for sentiment classification (Blitzer et al, 2007; Tan et al, 2009; Pan et al, 2010; Glorot et al, 2011; Yoshida et al, 2011; Bollegala et al, 2013; Xia et al, 2013; Yang and Eisenstein, 2015), but they still require a large amount of labeled training data from a fairly similar source domain, which is not always feasible Those algorithms tend to be computational-expensive and time-consuming (Mohammad and Turney, 2010; Fast et al, 2016).

Related Work
Domain-Specific Sentiment Word Embedding
Domain-Specific Sentiment Lexicon Induction
Domain-Specific Sentiment Classification of Documents
Sentiment Classification of Long Texts
Sentiment Classification of Short Texts
Detecting Neutral Sentiment
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call