An Enhanced Neural Word Embedding Model for Transfer Learning

Md Kowsher,Takeshi Koshiba,Md Shohanur Islam Sobuj,Md Fahim Shahriar,Pranab Kumar Dhar,Mohammad Shamsul Arefin,Nusrat Jahan Prottasha

doi:10.3390/app12062848

Abstract

Due to the expansion of data generation, more and more natural language processing (NLP) tasks are needing to be solved. For this, word representation plays a vital role. Computation-based word embedding in various high languages is very useful. However, until now, low-resource languages such as Bangla have had very limited resources available in terms of models, toolkits, and datasets. Considering this fact, in this paper, an enhanced BanglaFastText word embedding model is developed using Python and two large pre-trained Bangla models of FastText (Skip-gram and cbow). These pre-trained models were trained on a collected large Bangla corpus (around 20 million points of text data, in which every paragraph of text is considered as a data point). BanglaFastText outperformed Facebook’s FastText by a significant margin. To evaluate and analyze the performance of these pre-trained models, the proposed work accomplished text classification based on three popular textual Bangla datasets, and developed models using various machine learning classical approaches, as well as a deep neural network. The evaluations showed a superior performance over existing word embedding techniques and the Facebook Bangla FastText pre-trained model for Bangla NLP. In addition, the performance in the original work concerning these textual datasets provides excellent results. A Python toolkit is proposed, which is convenient for accessing the models and using the models for word embedding, obtaining semantic relationships word-by-word or sentence-by-sentence; sentence embedding for classical machine learning approaches; and also the unsupervised finetuning of any Bangla linguistic dataset.

Highlights

This paper proposes two BanglaFastText word embedding models (Skip-gram [6] and CBOW), and these are trained on the developed BanglaLM corpus, which outperforms the existing pre-trained Facebook FastText [7] model and traditional vectorizer approaches, such as Word2Vec
An ML-based model as well as a deep neural network long short-term memory (LSTM) and convolutional neural networks (CNN)-based model were used in order to perform classification using the vector representation as a feature
Two Bangla FastText word embedding pre-trained models were presented, with a toolbox trained on a huge Bangla corpus, including organized and non-organized datasets

Summary

Introduction

Word representation or the vector depiction of the words has been demonstrated to achieve major results in the modeling of language and activities involving natural language processing (NLP). Word embedding contains both semantic and syntactic information in words and can be used to measure word similarity in information retrieval (IR) [2] and NLP applications [3]. Because of the availability of major public resources and criteria, most of the existing research is restricted to English and other resource-rich languages. Bangla is the sixth most commonly spoken language on the planet

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Mar 10, 2022
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Enhanced Neural Word Embedding Model for Transfer Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

On the Explainability of Natural Language Processing Deep Models
Julia El Zini ... Mariette Awad
ACM Computing Surveys | VOL. 55
Julia El Zini, et. al.Julia El Zini ... Mariette Awad
03 Dec 2022
ACM Computing Surveys | VOL. 55

Word Embeddings for Natural Language Processing

-

01 Jan 2015
01 Jan 2015

Quantifying Gender Bias in Different Corpora
Marzieh Babaeianjelodar ... Stephen Lorenz
-
Marzieh Babaeianjelodar, et. al.Marzieh Babaeianjelodar ... Stephen Lorenz
20 Apr 2020
20 Apr 2020

Pretrained domain-specific language model for natural language processing tasks in the AEC domain
Zhe Zheng ... Jia-Rui Lin
Computers in Industry | VOL. 142
Zhe Zheng, et. al.Zhe Zheng ... Jia-Rui Lin
21 Jun 2022
Computers in Industry | VOL. 142

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Enhanced Neural Word Embedding Model for Transfer Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences