Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa

Achmad Yohni Wahyu Finansyah,Vincent Michael Sutanto,Fnu Afiahayati

doi:10.15294/sji.v9i1.30052

Achmad Yohni Wahyu Finansyah, Vincent Michael Sutanto + Show 1 more

Open Access

https://doi.org/10.15294/sji.v9i1.30052

Copy DOI

Abstract

Purpose: More and more data are stored in text form due to technological developments, making text data processing more difficult. It also causes problems in the text preprocessing algorithm, one of which is when two texts are identical, but are considered distinct by the algorithm. Therefore, it is necessary to normalize the text to get the standard form of words in a particular language. Spelling correction is often used to normalize text, but for Bahasa Indonesia, there has not been much research on the spell correction algorithm. Thus, there needs to be a comparison of the most appropriate spelling correction algorithms for the normalization process to be effective.Methods: In this study, we compared three algorithms, namely Levenshtein Distance, Jaro-Winkler Distance, and Smith-Waterman. These algorithms were evaluated using questionnaire data and tweet data, which both are in Bahasa Indonesia.Result: The fastest normalization time is obtained by the Jaro-Winkler, taking an average of 31.01 seconds for questionnaire data and 59.27 seconds for tweet data. The best accuracy is obtained by the Levenshtein Distance with a value of 44.90% for the questionnaire data and 60.04% for the tweet data. Novelty: The novelty of this research is to compare the similarity measure algorithm in Bahasa Indonesia. Therefore, the most suitable similarity measure algorithm for Bahasa Indonesia will be obtained.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Journal of Informatics	Publication Date: May 31, 2022
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa

Abstract

Talk to us

Similar Papers

More From: Scientific Journal of Informatics

Lead the way for us

Similar Papers

Comparison of Spelling Error Correction Algorithms for the Indonesian Language
Yanfi Yanfi ... Haryono Soeparno
-
Yanfi Yanfi, et. al.Yanfi Yanfi ... Haryono Soeparno
18 Mar 2023
18 Mar 2023

Improving Topical Social Media Sentiment Analysis by Correcting Unknown Words Automatically
Rayner Alfred ... Rui Wen Teoh
-
Rayner Alfred, et. al.Rayner Alfred ... Rui Wen Teoh
11 Dec 2018
11 Dec 2018

Identification of Synonyms Using Definition Similarities in Japanese Medical Device Adverse Event Terminology
Ayako Yagahara ... Masahito Uesugi
Applied Sciences | VOL. 11
Ayako Yagahara, et. al.Ayako Yagahara ... Masahito Uesugi
19 Apr 2021
Applied Sciences | VOL. 11

Context-sensitive normalization of social media text in bahasa Indonesia based on neural word embeddings
Renny Pradina Kusumawardani ... Faizal Johan Atletiko
Procedia Computer Science | VOL. 144
Renny Pradina Kusumawardani, et. al.Renny Pradina Kusumawardani ... Faizal Johan Atletiko
01 Jan 2018
Procedia Computer Science | VOL. 144

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Performance Comparison of Similarity Measure Algorithm as Data Preprocessing Stage: Text Normalization in Bahasa

Abstract

Talk to us

Similar Papers

More From: Scientific Journal of Informatics