Abstract

Named entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been made to develop techniques for the Urdu language, which is a prominent South Asian language with hundreds of millions of speakers across the globe. NER in Urdu is considered a hard problem owing to several reasons, including the paucity of large, annotated datasets; an inaccurate tokenizer; and the absence of capitalization in the Urdu language. To this end, this study proposed a conditional-random-field-based technique with both language-dependent and language-independent features, such as part-of-speech tags and context windows of words, respectively. As a second contribution, we developed an Urdu NER dataset (UNER-I) in which a large number of NE types were manually annotated. To evaluate the effectiveness of the proposed approach, as well as the usefulness of the dataset, experiments were performed using the dataset we developed and an existing dataset. The results of the experiments showed that our proposed technique outperformed the baseline technique for both datasets by improving the F1 scores by 1.5% to 3%. Furthermore, the results demonstrated that the enhanced dataset was useful for learning and prediction in a supervised learning approach.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.