Algorithm for Bengali Keyword Extraction

Md Ruhul Amin,Madhusodan Chakraborty

doi:10.1109/icbslp.2018.8554574

Abstract

We present algorithm for keyword extraction from a Bengali document. In natural language processing (NLP), keyword extraction is the automated process to identify a set of terms that represent the information discussed in a document. A lot of research works have been done for keyword extraction in resource rich languages. Some of those works followed supervised approach using specific corpus whereas the latest techniques use unsupervised approach. Keyword extraction procedure already achieved state-of-the-art performance for the resource rich languages. Only a few works have been done on the keyword extraction for documents in Bengali but none of them could achieve > 70% accuracy. In this article, we discuss the methods for extracting Bengali keywords from a specific document collection following unsupervised learning approach. Generally, Bengali keyword extraction is difficult in terms of words parsing, stemming, excluding stop words etc. The accuracy of those modules also impact the performance of the keyword extraction procedure. However, we obtained 87% accuracy to identify the correct Bengali keywords from a document. The procedure we have discussed for keyword extraction can also be applied to any language; but here we have provided all of our experimental results specifically for Bengali language.

Full Text