Abstract

Parallel corpus is an essential resource in Natural Language Processing (NLP) research, especially machine language translation. This paper presents the construction process of the Thai language and Isarn dialect bilingual parallel corpus, which includes word segmentation, translation and word alignment, part of speech (POS) tagging, and the parallel corpus design and construction. In the study, source sentences in Thai are segmented into a sequence of words by applying a Conditional Random Field (CRF) approach. We used the example and rule based Thai-Isarn machine translation system as a tool to generate the corresponding target sentence (Isarn dialect). The POS of each word is tagged using Hidden Markov Modeling (HMM). The source and target sentences with their POSs are validated by Isarn native speakers, who are expert in both Thai language and Isarn dialects. Lastly, the validated data were collected into the Thai Isarn parallel corpus.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call