Indonesian Part of Speech Tagging Using Hidden Markov Model – Ngram &amp; Viterbi

Denis Eka Cahyani,Mtchael Juan Vindiyanto

doi:10.1109/icitisee48480.2019.9003989

Abstract

Part of Speech (POS) Tagging is a process of labelling word classes on sentences. One of the POS Tagging problems is some words that spelt the same but have a different POS Tag depending on the context of the sentence (ambiguity). The approach to solving this problem is using the Hidden Markov Model (HMM) Ngram Algorithm and the Viterbi Algorithm. This study discusses the development of a system for Indonesian POS Tagging using the HMM N-gram algorithm (Bigram and Trigram) and the Viterbi algorithm and compares the result between the HMM Bigram and HMM trigram. An Indonesian language corpus that has been manually labeled called Indonesian Manually Tagged Corpus is used as the knowledge for the system. Then the corpus is processed using the HMM N-gram algorithm to get the rules. Furthermore, process the data with Viterbi algorithm using the previous formed rules to determine the POS tag with the highest probability. The highest accuracy results is 77.56% using the HMM Bigram - Viterbi Algorithm. While the HMM Trigram– Viterbi algorithm has the highest accuracy of 61.67%. The result shows that the system can solve the problem of tag ambiguity with HMM Ngram – Viterbi algorithm and the accuracy of HMM Bigram is better than the HMM Trigram.

Full Text