An arrangement of the number of K-Grams in the performance of rabin karp algorithm in text adjustment

Yuli Astuti ,Irma Wulandari

doi:10.11591/ijeecs.v22.i2.pp%p

Abstract

Rabin Karp Algorithm is oftentimes used to determine the similarity between texts, using the hash function as a comparison among the string that is being identified and the substring in the text. The choice of the k value in k-gram is often done unrestrained. The number of k values that can be used when cutting some terms will take longer time if tried one by one. In this research, a word cutting test will be performed on a script using K-gram 0 to 8. The results will cover the effect of the value of each k used on the percentage of similarity produced. This research aims to determine the effect of the number of K-grams on the performance of Rabin Karp in text matching. The test underwent 20 sentences and 10 times using the Dice Coefficient as the text similarity testing. The conclusion of this research is that the K-gram 0 to 2 should not be used because of the K-gram basic principle that is character deduction. Accordingly, if the character is 0.1.2 then it does not yet have a meaning thus it gets a high percentage of similarity, based on trials that have been carried out with taking samples of K-gram 0 to 8 from 10 test data sets, researchers recommend that the K-gram 3 is the best among K-grams 0 to 8.

Full Text