Sentiment Analysis in Tamil Texts using k-means and k-nearest neighbour

Sajeetha Thavareesan,Sinnathamby Mahesan

doi:10.1109/iciafs52090.2021.9605839

Abstract

Sentiment analysis is an active research in the field of Natural Language Processing which aims to predict the sentiments expressed in the written text into positive or negative category. In this paper, we propose a method that uses k-means clustering and k-nearest neighbour classifier to predict the sentiments expressed in Tamil texts.In this proposed method, the data points are considered in two different ways for clustering the corpus: Clustering by considering class-wise information and clustering without considering class-wise information. These two clustering-based techniques are experimented using m-folds of training samples. These form four distinct approaches. These four experiments are conducted using BoW and fastText word embeddings. Altogether we have performed eight experiments. Each approach is tested using varying number of centroids (k <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">c</inf> : 1..10), nearest neighbours (k <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">n</inf> : 1..k <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">c</inf> ) and folds (m <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">f</inf> : 1..10) to study their influence in the accuracy. The accuracy is proportional to the values of k <inf xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">c</inf> . Accuracy is found to be more for fastText than for BoW. The method with fastText and class-wise clustering with m-folds of training set has given 89.87% accuracy for the corpus, UJ_MovieReviews that we created from various sources.

Full Text