Machine translation is an automatic tool that can process language translation from one language to another. This research focuses on developing Statistical Machine Translation (SMT) from Indonesian to Bengkulu Malay and evaluating the quality of the machine translation output. The training and testing data consist of parallel corpora taken from Bengkulu Malay dictionaries and online resources for Indonesian corpora, with a total of 5261 parallel sentence pairs. Several steps are performed in SMT. The initial step is preprocessing, aimed at preparing the parallel corpus. After that, a training phase is conducted, where the parallel corpus is processed to build language and translation models. Subsequently, a testing phase is carried out, followed by an evaluation phase. Based on the research results, various factors influence the quality of SMT translation output. The most important factor is the quantity and quality of the parallel corpus used as the foundation for developing translation and language models. The machine translation output is automatically evaluated using the Bilingual Evaluation Understudy (BLEU), indicating accuracy values observed when using 500 sentences, 1500 sentences, 2500 sentences, 4000 sentences, and 5261 sentences are 80.56%, 90.86%, 92.50%, 92.91%, and 94.48% respectively.
Read full abstract