Word Embedding for Bengali Language using Domain-related Corpus

Ashutosh Bandyopadhyay,Jayashree Nair

doi:10.1109/icict57646.2023.10134311

Ashutosh Bandyopadhyay, Jayashree Nair

https://doi.org/10.1109/icict57646.2023.10134311

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Many natural language processing (NLP) tasks, including machine translation, document classification, information retrieval, news category classification, document clustering, news category clustering, and question-answering systems require the use of distributional word vector representation which is a very low dimensional vector representation of words that is called word embedding, such as contextual and non-contextual embedding. In addition to several contextual word-embedding approaches like BERT, IndicBERT, and SahojBERT (the Bengali equivalent of BERT), this work also covers numerous embedding strategies (such as Word2Vec, GloVe, and FastText) with multiple hyper-parameters. Of late, word embeddings from general corpus such as Wikipedia dump, and common crawl corpus are very famous to make word vectors. In the case of Indian languages also such type of corpus is used to make word vectors. But it suffers in giving good accuracy in general NLP tasks like Sentiment Analysis, News Category Classification, and News Category Clustering, especially for the Bengali Language. This research demonstrates, most importantly, that word embedding from a domain-related corpus enhances the quality of the embedding as those embeddings are working far better than general pre-trained word embeddings which can be used in various NLP tasks like News Category Classification, Word Similarity, and Document Clustering like News Category Clustering using various clustering techniques like KMeans, KMedoid algorithms and also obtaining better accuracy. This work proves that making word vectors or word embedding from a domain-related corpus for a particular NLP task derives better results alongside the quality of those embeddings. Throughout this work, a Deep Neural Network model was used. And the same model was used in all the tests and findings in this work. This work clearly states if word embedding can be derived from a corpus that is related to the domain on which NLP tasks will be performed, those embeddings will surely outperform the embeddings which are obtained from the common corpus or publicly available embeddings for the Bengali language. So, for a task like News Category Classification or Sentiment Analysis if word embedding can be created from a domain-related News Corpus or Sentiment related corpus it will perform way better. Nowadays, labeled data sets are hard to come by. This work demonstrated how unsupervised approaches, such as clustering, may be used effectively and produce excellent outcomes, such as news category clustering. This work recorded the performance of our embeddings in several scenarios, such as News Category Classification and Document Clustering tasks. Word Embedding created from a domain-related corpus shows promising results over a common corpus like Wikipedia dump corpus.

Full Text