Rule-based Part of Speech Tagger for Indonesian Language

K K Purnamasari,I S Suwardi

doi:10.1088/1757-899x/407/1/012151

Abstract

Lexical processing, such as detection of root words (Stemming) and type of words (Part of Speech tagging) is the important determinant for language computing systems that requires the detection of sentence structure or pattern. In Indonesian language, a problem that was encountered in lexical processing is lack of annotated corpus in Indonesian. Thus, POS tagger built in this research did not use annotated corpus, but utilize KBBI (Indonesian large dictionary) and some morphological rules that affect changes in word form (morphology). The method used in this study begin with change input text into tokens and do stemming. Each token is checked whether it is available in the baseword dictionary or not. If the token is not found in the baseword dictionary, it will go through stemming and affix detection. By doing this step, it can produce output list of POS tag for basewords and their affixes. By collate the output based on the rules of grammar, we can determine the type of affixed-word. Testing is done by comparing the detection results and the type of wordlist available in KBBI, for every input token. Accuracy score is obtained by calculate number of true results divided by total number of tokens being examined. Based on result of test performed, the achievement of accuracy is quite good (average rate of 87.4% for 4 parts of PAN Localization corpus in Indonesian). False results were caused by some mistake tags in existing KBBI and presence of ambigous word (word with more than one POS tag). So, improvement will be possible by using more complete Indonesian dictionary and adding word-sense disambiguation.

Full Text