Abstract

Sentiment analysis and spam detection of social media text messages are two challenging data analysis tasks due to sparse and high-dimensional feature vectors. Learning classifier systems (LCS) are rule-based evolutionary computing systems and have limited capabilities to handle real valued sparse high-dimensional big data sets. LCS techniques use interval based representations to handle real valued feature vectors. In the work presented here, interval based representation is replaced by genetic programming based tree like structures to classify high-dimensional real valued text feature vectors. Multiple experiments are conducted on different social media text data sets, i.e. tweets, movie reviews, amazon and yelp reviews, SMS and Email spam message to evaluate the proposed scheme. Real valued feature vectors are generated from these data sets using term frequency inverse document frequency and/or sentiment lexicons-based features. Results depicts the supremacy of the new encoding scheme over interval based representations in both small and large social media text data sets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call