Abstract

In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease and so on. Such entities, without an NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognizing and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili, and Magahi are low resource languages, usually known as Purvanchal languages. This article focuses on the development of an NER benchmark dataset for Machine Translation systems developed to translate from these languages to Hindi by annotating parts of the available corpora with named entities. Bhojpuri, Maithili, and Magahi corpora of sizes 228,373, 157,468, and 56,190 tokens, respectively, were annotated using 22 entity labels. The annotation considers coarse-grained annotation labels followed by the tagset used in one of the Hindi NER datasets. We also report a Deep Learning baseline that uses an LSTM-CNNs-CRF model. The lower baseline F 1 -scores from the NER tool obtained by using Conditional Random Fields models are 70.56% for Bhojpuri, 73.19% for Maithili, and 84.18% for Magahi. The Deep Learning-based technique (LSTM-CNNs-CRF) achieved 61.41% for Bhojpuri, 71.38% for Maithili, and 86.39% for Magahi. As the results show, LSTM-CNNs-CRF fails to outperform the lower baseline in the case of Bhojpuri and Maithili, which have more data in terms of the number of tokens, but not in terms of the number of named entities. However, the cross-lingual model training of LSTM-CNNs-CRF for Bhojpuri and Maithili performed better than the CRF.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.