Abstract
Nowadays, more and more people are sharing and expressing their feelings through social media platforms such as Twitter, Facebook and YouTube. Sentiment analysis is a process that explores, identifies and categorizes content. People that belong to multilingual communities tend to communicate through multiple regional languages. This type of text is represented using different languages and is known as code-mixed data. The proposed system utilizes a code-mixed data set of Tamil–English languages from FIRE 2021. To handle the class imbalance problem, re-sampling is performed and the impact is analyzed. Pre-processing of input text data can play a vital role in code-mixed data classification by removing unnecessary content. This research work aims to explore the impact of pre-processing on Tamil code-mixed data by employing various pre-processing steps such as emojis removal; repeated characters removal; and punctuations, symbols and number removal. The pre-processed text is applied to traditional machine learning, deep learning, transfer learning and hybrid deep learning models, and the accuracy of all these models before and after pre-processing is compared. Traditional machine learning models depend on various weighting schemes for the feature selection process. The main objective of this research work is to build hybrid deep learning models combining Convolutional Neural Network (CNN) with Long–Short Term Memory (LSTM) and Convolutional Neural Network (CNN) with Bi-Long–Short Term Memory (LSTM) in order to capture the local and global features implicitly from the code-mixed data for conducting sentiment analysis, and then classify the Tamil code-mixed data into positive, negative, mixed_feelings and unknown_state. The performance of hybrid deep learning models were evaluated by comparing them with state-of-art methods that include various traditional machine learning techniques such as random forest, multinomial Naive Bayes, logistic regression and linear Support Vector Classification (SVC); deep learning techniques such as LSTM, BiLSTM, BiGRU (Bidirectional Gated Recurrent Unit) and CNN; and a transfer learning method, IndicBERT. This research work also summarizes the precision, recall, F1-score, accuracy, macro-average, weighted-average and confusion matrix for all mentioned models. The result indicates that among all the different models employed, the hybrid deep learning model, especially the CNN+BiLSTM model performs better, with an accuracy of 0.66 with preprocessed Tamil code-mixed data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.