POS Tagger Improvisation with the Addition of Foreign Word Labels on Telkom University News

Donni Richasdy,Mahendra Dwifebri Purbolaksono,Winkie Setyono

doi:10.47065/bits.v4i2.1983

Donni Richasdy, Mahendra Dwifebri Purbolaksono + Show 1 more

Open Access

https://doi.org/10.47065/bits.v4i2.1983

Copy DOI

Abstract

News is a medium of daily information usually obtained by the public. The news consists of a lot of information in it and is composed of sentence structures. Each language is unique with its own sentence structure, like Indonesian and other foreign languages. But nowadays, many media mix Indonesian with foreign languages, making the sentence structure different from Bahasa Indonesia. To classify these words, Part Of Speech Tagging needed to determine the class of words composed of sentences by learning from the Corpus of each language. With the new sentence structure, POS Tagger requires a larger Corpus to learn. The language structure can determine the results of tagging from the POS Tagger. If there are words that are not in the Corpus, it can reduce the accuracy of the POS Tagger. We conducted to enhance the research results by adding data with a different sentence structure from the Indonesian Language Corpus using sentences from online media. Added about 242 sentences with 7,043 tokens on Corpus focused on Foreign Word tags, which total 3819 tags. After doing some testing and scenarios, the results of the accuracy of POS Tagger show an accuracy of 94.7% using the Hidden Markov Model method with the F1-Score tag FW 78%.

Full Text