The advancements of deep learning methods and the availability of large corpora and data sets have led to an exponential increase in the performance of Natural Language Processing (NLP) methods resulting in successful NLP applications for various day-to-day tasks such as Language translation, Voice to text, Grammar checking, Sentiment analysis, etc. These advancements enabled the well-resourced languages to adapt themselves to the digital era while the gap for the low-resource languages widened. This research work explores the suitability of the recent advancements in NLP for Tamil, a low-resource language spoken mainly in South India, Sri Lanka, and Malaysia. From the literature survey, it has been found that there is a lack of comprehensive study on the effect of the recent advancements of NLP for the Tamil text. To fill this gap, this research work analysed the performance of deep learning based text representation and classification approaches namely word embedding, Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN) for Tamil text classification tasks. Different dimensional pretrained word2Vec and FastText word vectors were built for Tamil and their effectiveness on Text classification was evaluated. The study found that the pre-trained 300- dimensional FastText word vector showed better performance than other pre-trained word vectors for Tamil text classification. Further, in this study, four simple hybrid CNN and Bi-GRU models were proposed for Tamil text classification and their performances were evaluated. The study found that hybrid CNN and Bi-GRU models perform better compared to the classical machine learning models, individual CNN and RNN models, and the Multilingual BERT model. These results confirm that the jointly learned embeddings with different deep learning architectures like CNN and RNN can achieve remarkable results for Tamil text classification, thus ensuring that the deep learning approaches can be successful for NLP on Tamil text.
Read full abstract