Normalization of Unstructured and Informal Text in Sentiment Analysis

Muhammad Javed,Shahid Kamal

doi:10.14569/ijacsa.2018.091011

Abstract

Sentiment Analysis is problem of natural language processing which deals with the extraction and analysis of public sentiments shared about target entities over microbloging websites. This field has gained great attention due to the huge availability of decision making textual contents. Sentiment Analysis has enormous application areas such as; Market Analysis, Service Analysis, Showbiz analysis, Movies, sports and even the popularity and acceptance rate of political policies can also be predicted via sentiment analysis systems. Although tremendous volume of opinionative text is available but it is unstructured and noisy due to which sentiment classifiers can’t achieve good outcomes. Normalization is the process used to clean noise from unstructured text for sentiment analysis. In this study we have proposed a mechanism for the normalization of informal and unstructured text. Proposed mechanism is comprised of four essential phases; Noise Reduction, Part of Speech Tagging, Stop Word Removal stemming and Lemmatization. Numerous experiments are performed on twitter data set with unsupervised lexicons and dictionaries. Python and Natural language toolkit is used for performing all four essential steps. This study demonstrates that utilization and normalization of informal tokens in tweets improved the overall classification accuracy from 75.42 to 82.357.

Highlights

Text Mining is computer assisted process introduced to help business organizations by providing effective decision making answers and future trends
This study presents a novel mechanism of text normalization in the classification of informal opinion bearing text
This research proposes a novel mechanism for normalization of publically available opinionative data for the sake of sentiment analysis

Summary

Introduction

Text Mining is computer assisted process introduced to help business organizations by providing effective decision making answers and future trends. The sites that allow short text for chatting, communication, exchanging views about their interests are considered as Microblogs. Twitter is the most popular microbloging site that allows its users to publish short messages (tweets) for communication. The rapid growth of socio communication devices and channels produced newer challenges for observers and analysts Online users publish their views and opinions in distinctive and informal way which is not directly translatable for machine learning system. They adopt acronyms, emotion icons and other microbloging features for communication. Sentiment Analysis task can’t be performed directly on these published reviews instead it requires massive effort of input text preparation. The rest of article is comprised of; Section 2 presents related work, section 3 method, section 4 results and discussion and section 5 presents Conclusion and Future work

Methods

Results

Conclusion