Abstract

Text documents stored on the system in an unstructured form, so that the information inside cannot be extracted directly. To be able to extract it, it takes text processing which is first carried out initial processing (preprocessing text) to convert text documents into more structured by selecting words that used as indexes. The smaller the index value, the more text documents are recognized on the system and the information is more easily extracted. The size of the index determined by the number of groups of words formed. To avoid forming many groups of words, then each word is changed to become a basic word first before grouping. The process of changing of affix word into a basic word using certain rules is called stemming. This research aims to produce a new Indonesian stemming algorithm named UG18 Stemmer algorithm, which can reduce or eliminate stemming errors such as over-stemming and under-stemming on existing stemming algorithms including the Enhanced Confix Stripping (ECS) Stemmer algorithm and the New Enhanced Confix Stripping (NECS) stemming algorithm. The method used is the morphophonemic process approach, which sees affixes as bound morphemes that experience phoneme changes, phoneme addition, and phoneme removal. The three processes are mapped, and Finite State Automata was made to obtain new affixed groups, sequences and new deletion methods that form the basis of the development of the UG18 Stemmer algorithm. This algorithm developed not using a list of decapitation rules used in pre-existing algorithms. Decapitation rules replaced with morphophonemic based elimination rules. Based on the evaluation results and testing of the UG18 Stemmer algorithm, it has a lower error rate compared to the results of stemming using NESC Stemmer. The result can be seen from the randomized test of 2500 word using Relevance Judgment validated by Indonesian language experts, from 1.48% over-stemming and 16.69% under-stemming using the NECS stemmer algorithm down to 0.12% overstemming and 0% understemming using the UG18 algorithm stemmer. Also, the UG18 Stemmer algorithm can improve the speed performance process in the information retrieval-based document similarity measurement application of 45.47% compared to using the ECS stemmer algorithm.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call