Abstract

A spell and grammar checker is profoundly essential for diverse publications especially for Bangla language in particular as it is spoken by millions of native speakers around the world. Considering the lack of research efforts, we demonstrate the development of a comprehensive Bangla spell and grammar checker with necessary resources. At first, a full-fledged and generalised Bangla monolingual corpus comprising over 100 million words has been built by scraping reputed, diversified online sources and then an extensive Bangla lexicon consisting of over 1 million unique words has been extracted from that corpus. Based on these corpus and lexicon, we have developed a combined spell and grammar checker application that simultaneously detects distinct spelling and grammatical mistakes and provides appropriate suggestions for both as well. The spell checker uses the Double Metaphone algorithm and Edit distance based on the distributed lexicons and numerical suffix dataset to detect all types of Bangla spelling mistakes with an accuracy rate of 97.21% individually. The grammar checker detects errors based on language model probability i.e. combination of bigram and trigram, and generates suggestions based on the Cosine similarity measure with the accuracy rate of 94.29% individually. The datasets and codes used in this work are freely available at https://git.io/JzJ4w .

Highlights

  • This section mentions the related works of three interconnected but distinct segments of our proposed system: corpus and lexicon, spell checker, and grammar checker.A

  • By studying the available Bangla corpus, lexicon, spell, and grammar checker, we have identified several limitations in the current approaches, including scarcity of balanced and extensive corpus, substantial lexicon, and efficient spell and grammar checker

  • We have developed a combined solution for spell and grammar checkers, the details of these two are presented in different sections for simplicity and better understanding

Read more

Summary

Introduction

This section mentions the related works of three interconnected but distinct segments of our proposed system: corpus and lexicon, spell checker, and grammar checker. A. CORPUS AND LEXICON A corpus is a collection of written texts, especially the entire works of a particular author or writing body on a particular subject. A lexicon is a vocabulary, a collection of words, or a complete set of meaningful units in a language. Central Institute of Indian Languages (CIIL) [5] first introduced a Bengali corpus along with a corpus of other nine Indian languages in 2001.

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call