Large Language Model-Based Detoxification for Bahasa Indonesia

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Abstract This study develops a detoxification model for Indonesian text by leveraging Large Language Models (LLMs) to transform toxic content into neutral expressions while preserving original meaning. Addressing the lack of effective detoxification methods in Bahasa Indonesia – mainly due to the scarcity of parallel datasets – the research applies supervised learning by fine-tuning LLaMA3-8B and Sahabat-AI on crowdsourced parallel datasets, complemented by unsupervised techniques such as masking and paraphrasing. Human evaluation shows that the structurally enhanced Sahabat-AI model outperforms other approaches in reducing toxicity, preserving content, and ensuring fluency. While masking achieves the fastest inference time, it often fails to retain meaning; paraphrasing offers fluency but alters the intended meaning. The LLaMA3-8B model effectively retained meaning but left residual toxicity. These findings underscore the effectiveness of the enhanced Sahabat-AI model in detoxifying Indonesian text, contributing to healthier digital discourse, and preserving a more peaceful society.

Similar Papers
  • Research Article
  • 10.30870/jels.v2i2.2246
An Investigation of Students’ Strategies and Methods in Translating English Text into Indonesian
  • Sep 29, 2017
  • Journal of English Language Studies
  • Yudhie Indra G + 1 more

English has important function in every aspect in this era, especially in education side. Someone demanded to know the meaning and function of English word. So that, translation is also the other four skills in mastering English language, besides Listening, Speaking, Reading, and Writing, Translation is also has main function. The objectives of this research were to find out if the students commit translation in English Text, to describe the method which was applied by the students to translate the English text, to observe the students ability in translate the English text. Therefore, this study seemed very relevant to a case study because it attempted to investigate and explore the student’s strategies in one class of 23 students. In this research, the researcher applied test to the students. In technique of the test, the researcher gave the English text and asked the students to translated the English text as well as possible as they could. There are four criteria to assess the student’s translation, they are message content accuracy, message distinct, equivalences of using language and the last are mechanic appropriateness. The researcher also has done interviewed the translation lecturer regarded as the key of the informant who really knew how far the student’s ability to translate English text. The result of the interview was given to the teacher; showed 4 th semester students still got lot of difficulties, they only used one technique in translate. Especially the words seldom their finds in the English lesson books. Furthermore, the lecturer should be creative by finding the appropriate method of translation for 4 th semester students in translating English text into Indonesian language, and the lecturer also gives always motivation to the students that the text must understand with translate the source language into their target language exactly in Indonesian language. Meanwhile the students also needed longer time to practice in translating text from English text into Indonesian text because they don’t have right method and also lack of vocabulary in translating English text although the lecturer had already given exercise to practice of English text, but it was not enough to build students interest to practice translating English text. Keywords: Translation; Students Translation Strategies and Method; Indonesian Text

  • PDF Download Icon
  • Research Article
  • 10.58788/alwijdn.v8i4.3002
Integration Of Islamic Values In Indonesian Procedural Text Material Grade VII SMP/MTS
  • Oct 26, 2023
  • AL-WIJDÃN Journal of Islamic Education Studies
  • Istiqomah Ramdhaniyah + 3 more

Indonesian language learning is considered to be lacking or even does not contribute to strengthening the character of students. In fact, Indonesian language learning in junior and senior high schools can be oriented towards strengthening character. This study aims to analyze and integrate Islamic values in Indonesian procedural text lessons for grade VII SMP/MTs. The method used is qualitative with literature study technique with the focus of the study, which is to reveal Islamic values that can be integrated in the text material of Indonesian language procedures of grade VII SMP/MTs as a means of strengthening the character of students. The data sources used were Indonesian procedural texts of grade VII SMP/MTs and literatures related to this topic. This study took 5 examples contained in the structure of the procedure text in the Indonesian language textbook grade VII SMP/MTs. The results of the study found that Islamic values in Indonesian procedural texts of grade VII SMP/MTs include morals or praiseworthy traits, worship, morals, piety to Allah SWT, challenges for the environment, and others and are strengthened through literature sources that take arguments from the Qur'an and hadith. The consequences of the examination can be used as one of the notes to find out the integration of Islamic values in the Indonesian procedural text lessons of grade VII SMP / MTs. Keywords: Indonesian Language, Integration, Islamic Values, Procedure Text, Character Education

  • Research Article
  • 10.26858/eralingua.v6i2.35099
The Shift of Adjunct Structures of Indonesian Translation of English in a News Portal: A Linguistic Study
  • Oct 17, 2022
  • Eralingua: Jurnal Pendidikan Bahasa Asing dan Sastra
  • Moh Khoirul Anam

Abstract. The purpose of the present study was to explore the phrase that functions as an adjunct in the articles which are written in two version of the language: English and Bahasa Indonesia. Therefore, it can be seen in the deep structure of the article whether the translations are kept to use the same structure from the English or to use another one in the Indonesian translation. This will reveal the strategy that the translator uses in translating the articles. News Articles from www.bbc.com were taken as the data of the present study. In gaining the data, purposive sampling technique was used. This technique was to find out more comprehensive data. There are twenty-two data studied in this study which comprises eleven English and Indonesian text in equal. The data were the same articles that were written in English and their translations, in Bahasa Indonesia. Results of the present study show that, first, there were different variations of constituent constructing the phrase when written in both languages. Second, two different constituents can create a new phrase in which there is no such a head representing the phrase. Third, a preposition in Bahasa Indonesia can precedes the adjectives.Keywords: English, Indonesian, Adjunct, News Portal

  • Research Article
  • 10.29210/020244471
Digital humanities approaches to analyzing indonesian language texts as non-western languages
  • Dec 25, 2024
  • JPPI (Jurnal Penelitian Pendidikan Indonesia)
  • Rastya Mutiarani Zahra

This research explores the integration of digital humanities methods in the analysis of Indonesian language texts to enhance linguistic and cultural understanding. The primary objective is to develop tailored digital humanities methodologies, applying computational tools such as text mining, natural language processing, and corpus linguistics to analyze linguistic and thematic patterns within Indonesian texts. By leveraging these techniques, the study aims to provide new insights into language use, cultural narratives, and historical shifts in Indonesia. A qualitative approach, including a literature review and case studies, is used to examine existing research and methodologies, and assess how digital tools can be effectively applied in this context. The study also addresses the accessibility of Indonesian textual data for researchers, educators, and students, proposing solutions to make these resources more usable and integrated into the global digital humanities framework. This research contributes to expanding the scope of digital humanities by incorporating Indonesian language texts, offering a model for future studies in non-Western linguistic traditions.

  • Research Article
  • 10.1088/1757-899x/1098/3/032041
A comparative review of extractive text summarization in Indonesian language
  • Mar 1, 2021
  • IOP Conference Series: Materials Science and Engineering
  • W Widodo + 2 more

Text summarization has important role in natural language processing. One of text summarization type is extractive summarization. Research on text summarization in Indonesian Language is still rare and not evaluated comprehensively. Each research is only conducted based on subjectivity of researcher. This paper reviewed and evaluated some works on Indonesian Language Text Summarization for obtaining the better method by analysing some aspects. This review also mapped Indonesian text summarization evaluation techniques and obtained its advantages and drawbacks. This research aims to provide a comprehensive review of text summarization in Indonesian Language. Result of this study is a comparative review of some works which showed detailed aspects in summarization method.

  • Conference Article
  • 10.1145/3416797.3416822
Generating of SIBI Animated Gestures from Indonesian Text
  • Jul 19, 2020
  • E Rakun + 1 more

Sign System for Bahasa Indonesia (SIBI) is the official sign language authorised by The Ministry of Education and Culture of Indonesia and being used as one of the communication media by School for Children with Special Needs (SLB) for people with hearing impairments in the process of learning. For people who have a lack of knowledge about SIBI gestures certainly will have difficulty to communicate with people with hearing impairments. Thus, a translation system from SIBI gestures to sentences in Bahasa Indonesia or vice versa is needed. This research is the initial stage of a translation system from sentences in Bahasa Indonesia to SIBI Gestures. The focus of this research is to generate sign gestures in the form of 3D Animation from a sentence input in text format and deployed on the smartphone device. The generation process starts from deconstructing the input sentence into its word components using a look-up table that consists of affixes, root words, and a “slang” dictionary. Then, this word components are referred to their corresponding gesture animations. The gesture data are recorded with motion-capture sensor “Perception Neuron v2” and use the official SIBI Dictionary as the reference. The animation is rendered in Unity engine and to create a smooth transition between one gesture animation to another, a LERP (Linear Interpolation) based method is implemented using Animancer API. Based on evaluation results, produced gestures correctly represent smooth SIBI gestures with the most significant accuracy score of 97.56% with a user satisfaction score of Very Satisfied is 84%, Satisfied is 14%, and Fair is 2%.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/icodse.2014.7062491
Correlation analysis of user influence and sentiment on Twitter data
  • Nov 1, 2014
  • Fadhli Mubarak Bin Naina Hanif + 1 more

Microbloging Twitter is a service that is widely used because of the need for rapid communication or cheaper than blogs, email, instant messaging or web. The growth of Twitter users has increased very rapidly in recent years. Thus the need for the utilization of Twitter either in the promotion of a product or the introduction of self-governance necessary for future leaders. These researches tried to calculate the popularity by calculating the value of user influence and sentiment. The research of sentiment on Indonesian text only focuses on sentiment classification. There has been no research on scoring or calculation of the sentiment value. The calculation of sentiment value is needed to determine the magnitude of a good or bad someone assessment by the value of a product or a person. Popularity analysis using Bayesian probability is to measure the value of the influence. Measurements of sentiment consist of 3 main parts such as value of verbs, adjectives, and adverbs in Indonesian language. In this research, analyze the value of influence and sentiment of someone using the Pearson correlation method. The negative correlation on President candidate is higher than positive correlation. The low sentiments value will have a greater impact to increase the influence value or vice versa. The accuracy of the sentiment on Bahasa Indonesia text is 73% It can be increased by improving the preprocessing process on Bahasa Indonesia. This research provides two contributions, namely calculating the value of sentiment on Bahasa Indonesia and analysis of sentiment and influence patterns of relationships.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1007/978-3-662-46742-8_14
The Comparation of Distance-Based Similarity Measure to Detection of Plagiarism in Indonesian Text
  • Jan 1, 2015
  • Tari Mardiana + 2 more

The accesible loose information through the Internet leads to plagiarism activities use the copy-paste-modify practice is growing rapidly. There have been so many methods, algorithm, and even softwares that developed till this day to avoid and detect the plagiarism which can be used broadly unlimited on a certain subject. Research about detection of plagiarism in Indonesian Language develop day by day, although not significant as English Language. This paper proposes several models of distance-based similarity measure which could be used to assess the similarity in Indonesian text, such as Dice’s similarity coefficient, Cosine similarity, and Jaccard coefficient. It implemented together with Rabin-Karp algorithm that common used to detect plagiarism in Indonesian Language. The analysis technique of plagiarism is fingerprint analysis to create fingerprint document according to n-gram value that has been determined, then the similarity value will be counted according to the same number of fingerprint between texts. Small data text about Information System tested in this case and it divided into four kinds of text document with some modified. First document is original text, second is 50% of original text adding with 50% of another text, third 50% original text modified using sinonym and paraphase, fourth some position of text in original text changed. From the experimental result, cosine similarityshow better performance in generating value accuracy compared to the dice coefficient and Jaccard coefficient. This model is expected to be used as an alternative type of statistical algorithms that implement the n-grams in the process especially to detect plagiarism in Indonesian text.KeywordsFingerprintIndonesianPlagiarismSimilarityText

  • Research Article
  • 10.23917/khif.v9i2.21495
Combination of Graph-based Approach and Sequential Pattern Mining for Extractive Text Summarization with Indonesian Language
  • Oct 29, 2023
  • Khazanah Informatika : Jurnal Ilmu Komputer dan Informatika
  • Dian Sa'Adillah Maylawati + 2 more

The great challenge in Indonesian automatic text summarization research is producing readable summaries. The quality of text summary can be reached if the meaning of the text can be maintained properly. As a result, the purpose of this study is to improve the quality of extractive Indonesian automatic text summarization by taking into account the quality of structured text representation. This study employs Sequential Pattern Mining (SPM) to generate a sequence of words as a structured representation of text and a graph-based approach to generate automatic text summarization. The SPM algorithm used is PrefixSpan, and the graph-based approach uses the Bellman-Ford algorithm. The results of an experiment using the IndoSum dataset show that combining SPM and Bellman-Ford can improve the precision, recall, and f-measure of ROUGE-1, ROUGE-2, and ROUGE-L. When Bellman-Ford is combined with SPM, the F-measure of ROUGE-1 increases from 0.2299 to 0.3342. The ROUGE-2 f-measure increases from 0.1342 to 0.2191, and the ROUGE-L f-measure increases from 0.1904 to 0.2878. This result demonstrates that SPM can improve the performance of the Bellman-Ford algorithm in producing Indonesian text summaries.

  • Research Article
  • Cite Count Icon 18
  • 10.12928/telkomnika.v17i1.10183
Comparison of stemming algorithms on Indonesian text processing
  • Feb 1, 2019
  • TELKOMNIKA (Telecommunication Computing Electronics and Control)
  • Afian Syafaadi Rizki + 2 more

Stemming is one of the stages performed on the process of extracting information from the text. Stemming is a process of converting words into their roots. There is an indication that the most accurate stemmer algorithm is not the only way to achieve the best performance in information retrieval (IR). In this study, seven Indonesian stemmer algorithms and an English stemmer algorithm are compared, they are Nazief, Arifin, Fadillah, Asian, Enhanched confix stripping (ECS), Arifiyanti and Porter. The data used are 2,734 tweets collected from the official twitter account of PLN. First, the aims are to analyze the correlation between stemmer accuracy and information retrieval performance in Indonesian text language. Second, is to identify the best algorithm for Indonesian text processing purpose. This research also proposed improved algorithm for stemming Indonesian text. The result shows that correlation found in the previous research does not occur for the Indonesian language. The result also shows that the proposed algorithm was the best for Indonesian text processing purpose with weighted scoring value of 0.648.

  • Research Article
  • 10.1088/1742-6596/978/1/012030
An analysis of absorbing image on the Indonesian text by using color matching
  • Mar 1, 2018
  • Journal of Physics: Conference Series
  • G A Hutagalung + 5 more

The insertion of messages in an image is performed by inserting per character message in some pixels. One way of inserting a message into an image is by inserting the ASCII decimal value of a character to the decimal value of the primary color of the image. Messages that use characters in letters, numbers or symbols, where the use of letters of each word is different in number and frequency of use, as well as the use of letters in various messages within each language. In Indonesian language, the use of the letter A to be the most widely used, and the use of other letters greatly affect the clarity of a message or text presented in the language. This study aims to determine the capacity to absorb the message in Indonesian language from an image and what are the things that affect the difference. The data used in this study consists of several images in JPG or JPEG format can be obtained from the image drawing software or hardware of the image makers at different image sizes. The results of testing on four samples of a color image have been obtained by using an image size of 1200 X 1920.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ismsc.2015.7594040
A language identifier for Indonesian and Malay text document
  • May 1, 2015
  • Zul Indra + 3 more

There is huge growth of online text documents in the Internet today. We can easily find documents written in languages from all over part of the just from a single click. Increasing number of online text document in Internet makes the increased availability of information on the Internet. In fact that none in the world can understand all languages of the digital documents. Hence, there is a significant need to have a language identifier to assist user to understand the information. Up to now, the language identification is more focused in European languages and still limited for Asian languages. Whilst the research of language identification for similar languages from popular languages has attracted the attention of many researchers. In this research, a new language identification for language with similar topology, Malay and Indonesian language, is proposed. The algorithm is experimented on a set of Indonesian and Malay text documents to support the limited research of language identification for Asian language. An experiment done on 100 Indonesian and Malay text documents has produced a number of satisfactorily accurate results.

  • Research Article
  • 10.18510/hssr.2020.83102
THE EXISTENCE OF ARABIC LANGUAGE IN INDONESIAN SOURCE TEXT AND ENGLISH TARGET TEXT
  • Jun 19, 2020
  • Humanities & Social Sciences Reviews
  • Erlina Zulkifli Mahmud + 2 more

Purpose: To study the existence of the Arabic language in the Indonesian language mostly limited to terms used in Islam religion.
 Methodology: This article discusses the existence of Arabic literature in the Indonesian source text, a novel with the life in a pesantren as the setting, where the author of the source text needs to translate the Arabic expressions used in the story into Indonesian. Then from the Indonesian source text, the novel is translated into English. The method used in this research is the descriptive comparative method. The leading theory used for this research is the strategies of Translation by Vinay and Darbelnet (1995), what Arabic linguistic units involved in the Indonesian source text, and what strategy of conversion used by the author and the translator become the objectives of this research.
 Principal Findings: The results show that the Arabic linguistic units found are ranging from a word into a clause or sentence, and the strategies of Translation used in the target text do not always deal with one single procedure; sometimes, it involves a combination of some procedures.
 Applications of this study: The translation work may lead to similar as well as a contrastive linguistic phenomenon. People can learn more about languages involving in a translation, particularly when the structures of the source and target language are compared linguistically.
 Novelty/Originality of this study: This study covers the gap left in the previous research carried out by the same team entitled “Translation Equivalences of Islamic Terms in the Novel (The Land of Five Towers ‘Negeri Lima Menara’). This previous research used the same data source, Arabic expressions, in the novel. It focused more on the Arabic feelings relating to Islamic terms, such as names of five obligational prayers, names of optional prayers, activities in shalat, or praying. The rest of the Arabic phrases which are not used in this previous research are left unstudied.

  • Research Article
  • 10.51708/apptrans.v13n2.522
Problem towards local language translation in artificial neural networks
  • Jan 1, 2019
  • Applied Translation
  • Wuyi Len

Regional languages ​​are the languages ​​used to communicate with each other in certain areas. Many factors have weakened the current generation's awareness of preserving the local language. One of them is the lack of means that can be used to access information from the regional language itself, so this is one of the obstacles that occur. This study will design a system for translating an image/image containing Indonesian text into a text in regional languages. This research starts from the pre-processing stage, the character segmentation technique in the image uses the Connected Component Analysis labeling, then the image is extracted then the character image is classified using the Artificial Neural Network method. The next step is combining characters into the text. After that, the translation process uses the Levenshtein algorithm to match the text classification results with regional languages. This research is expected to be able to translate Indonesian text images into regional language texts, to help preserve regional languages ​​in Indonesia.

  • Research Article
  • 10.15575/join.v10i1.1506
Performance of Machine Learning Algorithms on Automatic Summarization of Indonesian Language Texts
  • May 13, 2025
  • Jurnal Online Informatika
  • Galih Wiratmoko + 2 more

Automatic text summarization (ATS) has become an essential task for processing huge amounts of information efficiently. ATS has been extensively studied in resource-rich languages like English, but research on summarization for under-resourced languages, such as Bahasa Indonesia, is still limited. Indonesian presents unique linguistic challenges, including its agglutinative structure, borrowed vocabulary, and limited availability of high-quality training data. This study conducts a comparative evaluation of extractive, abstractive, and hybrid models for Indonesian text summarization, utilizing the IndoSum dataset which contains 20,000 text-summary pairs. We tested several models including LSA (Latent Semantic Analysis), LexRank, T5, and BART, to assess their effectiveness in generating summaries. The results show that the LexRank+BERT hybrid model outperforms traditional extractive methods, achieving better ROUGE precision, recall, and F-measure scores. Among the abstractive methods, the T5-Large model demonstrated the best performance, producing more coherent and semantically rich summaries compared to other models. These findings suggest that hybrid and abstractive approaches are better suited for Indonesian text summarization, especially when leveraging large-scale pre-trained language models.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.