Abstract

Parallel texts corpora are essential resources in linguistics and natural language processing, especially in translation and multilingual information retrieval. The publicly available parallel text corpora are limited to certain genres, types and domains. Furthermore, the parallel dialect text is scarce, even though they are important in the analysis and study of a dialect. Collecting parallel dialect text is challenging because dialects typically appear in the form of speech and very limited dialectic texts exist. Moreover, there is no standard orthography in most dialects. The contributions of this paper are threefold. First, the paper describes a methodology in acquiring a parallel text corpus of Standard Malay and Malay dialects, particularly Kelantan Malay and Sarawak Malay. Second, we propose a hybrid of distance based and statistical-based alignment algorithm to align words and phrases the parallel text. The results show that the precision and recall values of the proposed alignment algorithm are more than 95% and better than the state-of the-art GIZA++. Third, the alignment obtained were compared to find out the lexical similarities and differences between Standard Malay and the two studied Malay dialects, contributing valuable insights into the linguistic variations within the Malay language family.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call