Kelantan and Sarawak Malay Dialects: Parallel Dialect Text Collection and Alignment Using Hybrid Distance-Statistical-Based Phrase Alignment Algorithm

Khaw, Jasmina Yen Min Et.Al

doi:10.17762/turcomat.v12i3.1160

Abstract

Parallel texts corpora are essential resources especially in translation and multilingual information retrieval. However, the publicly available parallel text corpora are limited to certain types and domains. Besides, Malay dialects are not standardized in term of writing. The existing alignment algorithms that is used to analayze the writing will require a large training data to obtain a good result. The paper describes our methodology in acquiring a parallel text corpus of Standard Malay and Malay dialects, particularly Kelantan Malay and Sarawak Malay. Second, we propose a hybrid of distance-based and statistical-based alignment algorithm to align words and phrases of the parallel text. The proposed approach has a better precision and recall than the state-of-the-art GIZA++. In the paper, the alignment obtained were also compared to find out the lexical similarities and differences between SM and the two dialects.

Highlights

“Dialect” according to the Oxford dictionary is “a particular form of a language which is peculiar to a specific region or social group.”.Dialectology compares and describes various dialects, or sub-languages, of a common language, which are used in different areas of aregion.Dialectometry, a sub-component of dialectology, is “the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography”.Many studies in dialect look at the phonological and phonetic differences between dialects
We describe our work in collecting a parallel text corpus of SM and Malay dialects
We propose a phrase-based alignment algorithm that uses Levenshtein distance and statistical technique for aligning words in dialects

Summary

Introduction

“Dialect” according to the Oxford dictionary is “a particular form of a language which is peculiar to a specific region or social group.”.Dialectology compares and describes various dialects, or sub-languages, of a common language, which are used in different areas of aregion.Dialectometry, a sub-component of dialectology, is “the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography”.Many studies in dialect look at the phonological and phonetic differences between dialects. A more focused work in studying the Dutch dialect variation is the proposition of a model based on articulography that measures the position of tongue and lips during speech (Wieling, et al, 2016). The study of the lexical differences is interesting becausenative speakers communicate through writing, besides speech,often in social media such as blogs and forums

Methods

Parallel corpusacquisition

Data alignment

Building Malay Dialect Parallel Text Corpus

Transcribing and translating dialect dialogues

Aligning transcribed dialect words and phrases

Evaluation And Analysis of The Dialect Alignment Algorithm

Malay Dialect Lexical Analysis

KDlexical analysis

SDspelling analysis

Conclusions and Future Work

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Turkish Journal of Computer and Mathematics Education (TURCOMAT)	Publication Date: Apr 10, 2021
Citations: 1	License type: cc-by

R Discovery Prime

R Discovery Prime

Kelantan and Sarawak Malay Dialects: Parallel Dialect Text Collection and Alignment Using Hybrid Distance-Statistical-Based Phrase Alignment Algorithm

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Turkish Journal of Computer and Mathematics Education (TURCOMAT)

Lead the way for us

Similar Papers

HYBRID DISTANCE-STATISTICAL-BASED PHRASE ALIGNMENT FOR ANALYZING PARALLEL TEXTS IN STANDARD MALAY AND MALAY DIALECTS
Jasmina Khaw Yen Min ... Bali Ranaivo- Malancon
Malaysian Journal of Computer Science | VOL. 37
Jasmina Khaw Yen Min, et. al.Jasmina Khaw Yen Min ... Bali Ranaivo- Malancon
31 Jan 2024
Malaysian Journal of Computer Science | VOL. 37

Harumi Tanabe and John Scahill with Shoko Ono, Keiko Ikegami, Satoko Shimazaki, and Koichi Kano, eds., Sawles Warde and the Wooing Group: Parallel Texts with Notes and Wordlists. (Studies in English Medieval Language and Literature 48.) Frankfurt am Main and New York: Peter Lang, 2015. Pp. xii, 170. $52.95. ISBN: 978-3-631-66305-9.
Robert Hasenfratz
Speculum | VOL. 92
Robert HasenfratzRobert Hasenfratz
01 Jan 2017
Speculum | VOL. 92

A Hybrid of Sentence-Level Approach and Fragment-Level Approach of Parallel Text Extraction from Comparable Text
Keng Hoon Gan ... Yin-Lai Yeong
Procedia Computer Science | VOL. 161
Keng Hoon Gan, et. al.Keng Hoon Gan ... Yin-Lai Yeong
01 Jan 2019
Procedia Computer Science | VOL. 161

Construction of English-Bodo Parallel Text Corpus for Statistical Machine Translation
Abhijit Paul ... Ismail Hussain
International Journal on Natural Language Computing | VOL. 7
Abhijit Paul, et. al.Abhijit Paul ... Ismail Hussain
30 Oct 2018
International Journal on Natural Language Computing | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Kelantan and Sarawak Malay Dialects: Parallel Dialect Text Collection and Alignment Using Hybrid Distance-Statistical-Based Phrase Alignment Algorithm

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Turkish Journal of Computer and Mathematics Education (TURCOMAT)