Abstract
Due to the expansion of data generation, more and more natural language processing (NLP) tasks are needing to be solved. For this, word representation plays a vital role. Computation-based word embedding in various high languages is very useful. However, until now, low-resource languages such as Bangla have had very limited resources available in terms of models, toolkits, and datasets. Considering this fact, in this paper, an enhanced BanglaFastText word embedding model is developed using Python and two large pre-trained Bangla models of FastText (Skip-gram and cbow). These pre-trained models were trained on a collected large Bangla corpus (around 20 million points of text data, in which every paragraph of text is considered as a data point). BanglaFastText outperformed Facebook’s FastText by a significant margin. To evaluate and analyze the performance of these pre-trained models, the proposed work accomplished text classification based on three popular textual Bangla datasets, and developed models using various machine learning classical approaches, as well as a deep neural network. The evaluations showed a superior performance over existing word embedding techniques and the Facebook Bangla FastText pre-trained model for Bangla NLP. In addition, the performance in the original work concerning these textual datasets provides excellent results. A Python toolkit is proposed, which is convenient for accessing the models and using the models for word embedding, obtaining semantic relationships word-by-word or sentence-by-sentence; sentence embedding for classical machine learning approaches; and also the unsupervised finetuning of any Bangla linguistic dataset.
Highlights
This paper proposes two BanglaFastText word embedding models (Skip-gram [6] and CBOW), and these are trained on the developed BanglaLM corpus, which outperforms the existing pre-trained Facebook FastText [7] model and traditional vectorizer approaches, such as Word2Vec
An ML-based model as well as a deep neural network long short-term memory (LSTM) and convolutional neural networks (CNN)-based model were used in order to perform classification using the vector representation as a feature
Two Bangla FastText word embedding pre-trained models were presented, with a toolbox trained on a huge Bangla corpus, including organized and non-organized datasets
Summary
Word representation or the vector depiction of the words has been demonstrated to achieve major results in the modeling of language and activities involving natural language processing (NLP). Word embedding contains both semantic and syntactic information in words and can be used to measure word similarity in information retrieval (IR) [2] and NLP applications [3]. Because of the availability of major public resources and criteria, most of the existing research is restricted to English and other resource-rich languages. Bangla is the sixth most commonly spoken language on the planet
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.