Web-Based Bengali News Corpus for Lexicon Development and POS Tagging

Asif Ekbal,Sivaji Bandyopadhyay

doi:10.17562/pb-37-3

Abstract

Lexicon development and Part of Speech (POS) tagging are very important for almost all Natural Language Processing (NLP) applications. The rapid development of these resources and tools using machine learning techniques for less computerized languages requires appropriately tagged corpus. We have used a Bengali news corpus, developed from the web archive of a widely read Bengali newspaper. The corpus contains approximately 34 million wordforms. This corpus is used for lexicon development without employing extensive knowledge of the language. We have developed the POS taggers using Hidden Markov Model (HMM) and Support Vector Machine (SVM). The lexicon contains around 128 thousand entries and a manual check yields the accuracy of 79.6%. Initially, the POS taggers have been developed for Bengali and shown the accuracies of 85.56%, and 91.23% for HMM, and SVM, respectively. Based on the Bengali news corpus, we identify various word-level orthographic features to use in the POS taggers. The lexicon and a Named Entity Recognition (NER) system, developed using this corpus, are also used in POS tagging. The POS taggers are then evaluated with Hindi and Telugu data. Evaluation results demonstrates the fact that SVM performs better than HMM for all the three Indian languages.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Web-Based Bengali News Corpus for Lexicon Development and POS Tagging

Abstract

Talk to us

Similar Papers

More From: Polibits

Lead the way for us

Journal: Polibits	Publication Date: Jun 30, 2008
Citations: 49

Similar Papers

Hidden Markov Model based Part of Speech Tagging for Nepali language
Abhijit Paul ... Bipul Syam Purkayastha
-
Abhijit Paul, et. al.Abhijit Paul ... Bipul Syam Purkayastha
01 Sep 2015
01 Sep 2015

Part of Speech Tagging in Bengali Using Support Vector Machine
Asif Ekbal ... Sivaji Bandyopadhyay
-
Asif Ekbal, et. al.Asif Ekbal ... Sivaji Bandyopadhyay
01 Dec 2008
01 Dec 2008

Combination of Genetic Algorithm and Brill Tagger Algorithm for Part of Speech Tagging Bahasa Madura
Nindian Puspa Dewi ... Ubaidi Ubaidi
Proceeding of the Electrical Engineering Computer Science and Informatics | VOL. 7
Nindian Puspa Dewi, et. al.Nindian Puspa Dewi ... Ubaidi Ubaidi
01 Oct 2020
Proceeding of the Electrical Engineering Computer Science and Informatics | VOL. 7

Part-of-Speech Tagging for Arabic Text using Particle Swarm Optimization and Genetic Algorithm
Ahmad T Al-Taani ... Fadi A Alkhazaaleh
Recent Advances in Computer Science and Communications | VOL. 15
Ahmad T Al-Taani, et. al.Ahmad T Al-Taani ... Fadi A Alkhazaaleh
01 Jun 2022
Recent Advances in Computer Science and Communications | VOL. 15

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Web-Based Bengali News Corpus for Lexicon Development and POS Tagging

Abstract

Talk to us

Similar Papers

More From: Polibits