Bangla text generation system by incorporating attention in sequence-to-sequence model

Nayan Banik Nayan Banik,Chayti Saha Chayti Saha,Kulsum Akter Shapna Kulsum Akter Shapna,Ikbal Ahmed Ikbal Ahmed

doi:10.30574/wjarr.2022.14.1.0292

Nayan Banik Nayan Banik, Chayti Saha Chayti Saha + Show 2 more

Open Access

https://doi.org/10.30574/wjarr.2022.14.1.0292

Copy DOI

Abstract

In this AI-driven digital era, the pervasive nature of digital data is possible due to the widespread and cheap access to the Internet. Internet is continuously flourishing with data in many forms. Among them, textual data are a great source of information where people share their expressions in written format. Social media, blogs, online newspapers, government documents are some notable textual data sources. Information extraction from this enormous amount of data by manual inspection is time-consuming, cumbersome, and sometimes impossible. Natural Language Processing (NLP) is the computational domain for addressing those limitations by solving human language-related problems. Text summarization, Named entity recognition, Question answering are some of them where a common task for a machine is to generate coherent text. In such scenarios, the input is a sequence of text, and the output is also a sequence, but they differ in length. Sequence-to-Sequence (Seq2Seq) is an algorithmic approach to address that scenario by utilizing layers of recurrent units. However, the simple Seq2Seq model fails to capture the long-term relationship on the input sequence. Research shows that the attention mechanism guides the model to concentrate on specific inputs. Existing literature shows a lack of quality research on this text generation problem in the Bangla language, whereas many languages show excellent results. This work aims to develop such a system by incorporating attention to the Seq2Seq model and justifying its applicability by comparing it with baseline models. The model perplexity shows that the system can generate human-level readable text using a preprocessed dataset.

Full Text