Abstract

Named entity recognition (NER) is a task of proper noun identification from natural language text and classification into various types such as location, person, and organization. Due to NER's applications in different natural language processing (NLP) tasks, numerous NER approaches and benchmark datasets have been proposed. However, developing NER techniques for low-resource languages is still limited due to the absence of substantial training datasets. Punjabi is a classic example of low resource language. Although various researchers have worked on Punjabi, they focused on the Gurmukhi script. To overcome the challenges in developing NER for the Shahmukhi script, we present an improved technique for Punjabi NER for the Shahmukhi script in this paper. We firstly extend the existing dataset by adding new NER classes by leveraging a novel Pool of Words data augmentation strategy. Our extended dataset has 11,31,509 tokens and 1,25,789 labeled entities with more named entities (NEs) than the older dataset. In the next step, we fine-tuned a transformer model known as Bidirectional Encoder Representations from Transformers (BERT) for the NER task. We performed experiments using the proposed approach on a new and older dataset version, showing that our method achieved competitive results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call