An Approach To Twitter Sentiment Analysis Over Hadoop

Yazala Ritika Siril Paul,Dilipkumar A Borikar

doi:10.14419/ijet.v7i4.5.20110

Abstract

Sentiment analysis is the process of identifying people’s attitude and emotional state from the language they use via any social websites or other sources. The main aim is to identify a set of potential features in the review and extract the opinion expressions of those features by making full use of their associations. The Twitter has now become a routine for the people around the world to post thousands of reactions and opinions on every topic, every second of every single day. It’s like one big psychological database that’s constantly being updated and which can be used to analyze the sentiments of the people. Hadoop is one of the best options available for twitter data sentiment analysis and which also works for the distributed big data, streaming data, text data etc. This paper provides an efficient mechanism to perform sentiment analysis/ opinion mining on Twitter data over Hortonworks Data platform, which provides Hadoop on Windows, with the assistance of Apache Flume, Apache HDFS and Apache Hive.

Highlights

We live in an era where the textual data on the Internet is growing at a very rapid pace and many companies are trying to use this huge amount of data to extract people‟s views towards their product
Why Sentiment Analysis? Everyday enormous amount of data is created from social networks, blogs and other media and diffused in to the World Wide Web (WWW)
Though Hadoop generally supports Java programming language, any other programming language can be used in Hadoop benefit from its Map Reduce technique

Summary

Introduction

We live in an era where the textual data on the Internet is growing at a very rapid pace and many companies are trying to use this huge amount of data to extract people‟s views towards their product. Hadoop: Hadoop is an Apache open source framework which is used for the distributed storage and for the processing of large datasets of Big Data using the Map-Reduce programming model. It consists of computer clusters which are built from commodity hardware. MapReduce: MapReduce is a parallel programming model used by Hadoop for writing distributed applications for efficient processing of large amounts of data (multi-terabyte datasets), on large clusters (thousands of nodes) of commodity hardware in a reliable and fault-tolerant manner. For every node (commodity hardware/ system) in a cluster, there will be a datanode These nodes manage the data storage of their system and performs the following tasks:.

Problem Statement

Literature Review

Proposed Approach

Findings

Conclusion