Abstract

Khasi is a Mon-Khmer language that belongs to the Austro-Asiatic language family. Khasi language is spoken by the indigenous people of the state Meghalaya in the North-Eastern part of India. The main purposeof this paper is to develop Part-of-Speech (PoS) tagger for the Khasi language using a Rule-based approach. To work on POS tagging, one needs a grammatically tagged corpus. However, the Khasi language does not have a standard corpus for PoS tagging. Therefore, another aim or purpose of this paper is to develop a Khasi lexicon or POS corpus and using the Rule-Based Brill’s Transformation to automatically tag the given Khasi text. While anticipating the challenges in building such acorpus, this paper has brought out an analysis based on the Khasi corpus of around 1,03,998 words in its initial phase. We also show in this paper how the Khasi corpusis created. By using Brill’s Transformation rule-based learning on the created corpus in this preliminary study, accuracies of 97.73% and 95.52% were obtained on validating data and testing data respectively. This workis the first attempt to investigate POS tagging using the rule-based model with the designed Khasi POS corpus.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.