Named Entity Recognition (NER) is an imperative Natural Language Processing (NLP) task which intents to identify and classify predefined named entities in a given span of text. For many Western and Asian languages, NER is a systematically premeditated and established task, however, a little work has been done for Shahmukhi. This paper presents Shahmukhi NER with four key contributions. First, a Bi-directional Long-Short Term Memory (BiLSTM) network based NER model has been developed by incorporating various features including character and word embeddings and Part of Speech (POS) tagging. Second, transfer learning has been employed by training context-free Word2Vec and contextualized Embeddings from Language Models (ELMo) word representations. The word representations have been trained using a Shahmukhi corpus of 14.9 million words. Third, we prepared a cleaner version of an existing Shahmukhi NER corpus by performing Unicode normalization and tokenization errands. The corpus has been deduplicated and results are reported on an unseen evaluation set which produced valid results. Fourth, we have studied the impact of two annotation schemes; Inside-Outside (IO) and Inside-Outside-Beginning (IOB) for Shahmukhi. Transfer learning was quite helpful to enhance the performance of NER models especially ELMo embeddings significantly improved the results by prompting contextualized embedding vectors. This is the first study to use character embeddings, POS tagging and transfer learning for Shahmukhi named entity recognition. The IO scheme based model achieved an accuracy of 98.60% with an f-score of 83.75. The IOB scheme based model performed with an accuracy of 98.43% and an f-score of 75.55. These scores are quite promising for an under-resourced morphologically-rich language.
Read full abstract