Abstract
Parallel texts corpora are essential resources especially in translation and multilingual information retrieval. However, the publicly available parallel text corpora are limited to certain types and domains. Besides, Malay dialects are not standardized in term of writing. The existing alignment algorithms that is used to analayze the writing will require a large training data to obtain a good result. The paper describes our methodology in acquiring a parallel text corpus of Standard Malay and Malay dialects, particularly Kelantan Malay and Sarawak Malay. Second, we propose a hybrid of distance-based and statistical-based alignment algorithm to align words and phrases of the parallel text. The proposed approach has a better precision and recall than the state-of-the-art GIZA++. In the paper, the alignment obtained were also compared to find out the lexical similarities and differences between SM and the two dialects.
Highlights
“Dialect” according to the Oxford dictionary is “a particular form of a language which is peculiar to a specific region or social group.”.Dialectology compares and describes various dialects, or sub-languages, of a common language, which are used in different areas of aregion.Dialectometry, a sub-component of dialectology, is “the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography”.Many studies in dialect look at the phonological and phonetic differences between dialects
We describe our work in collecting a parallel text corpus of SM and Malay dialects
We propose a phrase-based alignment algorithm that uses Levenshtein distance and statistical technique for aligning words in dialects
Summary
“Dialect” according to the Oxford dictionary is “a particular form of a language which is peculiar to a specific region or social group.”.Dialectology compares and describes various dialects, or sub-languages, of a common language, which are used in different areas of aregion.Dialectometry, a sub-component of dialectology, is “the measurement of dialect differences, i.e. linguistic differences whose distribution is determined primarily by geography”.Many studies in dialect look at the phonological and phonetic differences between dialects. A more focused work in studying the Dutch dialect variation is the proposition of a model based on articulography that measures the position of tongue and lips during speech (Wieling, et al, 2016). The study of the lexical differences is interesting becausenative speakers communicate through writing, besides speech,often in social media such as blogs and forums
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Turkish Journal of Computer and Mathematics Education (TURCOMAT)
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.