Abstract

Parallel texts corpora are essential resources especially in translation and multilingual information retrieval. However, the publicly available parallel text corpora are limited to certain types and domains. Besides, Malay dialects are not standardized in term of writing. The existing alignment algorithms that is used to analayze the writing will require a large training data to obtain a good result. The paper describes our methodology in acquiring a parallel text corpus of Standard Malay and Malay dialects, particularly Kelantan Malay and Sarawak Malay. Second, we propose a hybrid of distance-based and statistical-based alignment algorithm to align words and phrases of the parallel text. The proposed approach has a better precision and recall than the state-of-the-art GIZA++. In the paper, the alignment obtained were also compared to find out the lexical similarities and differences between SM and the two dialects.

Highlights

  • “Dialect” according to the Oxford dictionary is “a particular form of a language which is peculiar to a specific region or social group.”.Dialectology compares and describes various dialects, or sub-languages, of a common language, which are used in different areas of aregion.Dialectometry, a sub-component of dialectology, is “the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography”.Many studies in dialect look at the phonological and phonetic differences between dialects

  • We describe our work in collecting a parallel text corpus of SM and Malay dialects

  • We propose a phrase-based alignment algorithm that uses Levenshtein distance and statistical technique for aligning words in dialects

Read more

Summary

Introduction

“Dialect” according to the Oxford dictionary is “a particular form of a language which is peculiar to a specific region or social group.”.Dialectology compares and describes various dialects, or sub-languages, of a common language, which are used in different areas of aregion.Dialectometry, a sub-component of dialectology, is “the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography”.Many studies in dialect look at the phonological and phonetic differences between dialects. A more focused work in studying the Dutch dialect variation is the proposition of a model based on articulography that measures the position of tongue and lips during speech (Wieling, et al, 2016). The study of the lexical differences is interesting becausenative speakers communicate through writing, besides speech,often in social media such as blogs and forums

Methods
Parallel corpusacquisition
Data alignment
Building Malay Dialect Parallel Text Corpus
Transcribing and translating dialect dialogues
Aligning transcribed dialect words and phrases
Evaluation And Analysis of The Dialect Alignment Algorithm
Malay Dialect Lexical Analysis
KDlexical analysis
SDspelling analysis
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call