Identifying Sentence Structure in Bahasa Indonesia by Using POS Tag and LALR Parser

Dani Gunawan,Hardiani Putri Siregar,Opim Salim Sitompul

doi:10.1109/icced46541.2019.9161125

Abstract

The sentence structure can be used to obtain the meaning of the sentence. Bahasa Indonesia has several components of the sentence structure such as Subjek (Subject), Predikat (Predicate), Objek (Object), Keterangan (Adverb), and Pelengkap (Complement). This research aims to identify the sentence structure in Bahasa Indonesia. The pre-processing section includes text cleaning, tokenization, and POS tagging. Text cleaning removes all the punctuation that will not be used for the next process. Tokenization is used to split the sentences into tokens which are used as the input for the POS tagging process. Next, the processing section implements LALR parser to determine the sentence structure based on the label (the output of POS tagging) in each token. This research conducts two evaluation. The first evaluation compares the sentence structure generated by the expert with the LALR parser. The sentences provided by the expert is very standard and follow the Bahasa Indonesia grammar. According to the result, LALR parser successfully identifies all the sentence structure generated by the expert. The second evaluation is applying the LALR parser to 150 sentences from several categories of local online newspaper articles. The application successfully identifies 86.7% sentence structure and detect 3.3% incorrect sentence structure.

Full Text