A new model for persian multi-part words edition based on statistical machine translation

Morteza Zahedi ,Alireza Arjomandzadeh

doi:10.5829/idosi.jaidm.2016.04.01.04

Morteza Zahedi , Alireza Arjomandzadeh

Open Access

https://doi.org/10.5829/idosi.jaidm.2016.04.01.04

Copy DOI

Journal: Journal of AI and Data Mining	Publication Date: Jan 1, 2016
Citations: 2	License type: cc-by

Affiliation: University of Shahrood

Abstract

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some serious issues in Persian text processing and text readability. In order to cope with the issues, this work proposes a new model to correct spacing in multi-part words. The proposed method is based on statistical machine translation paradigm. In machine translation paradigm, text in source language is translated into a text in destination language on the basis of statistical models whose parameters are derived from the analysis of bilingual text corpora. The proposed method uses statistical machine translation techniques considering unedited multi-part words as a source language and the space-edited multi-part words as a destination language. The results show that the proposed method can edit and improve spacing correction process of Persian multi-part words with a statistically significant accuracy rate.

Highlights

Persian text consists of words which are made of multiple parts and they are called multi-part words
One of the most common problems in Persian text is incorrectly use of spaces between multi-part words which leads to non-integrity of multi-part words and it leads to incorrect word boundary detection that can be solved by replacing spaces with half-spaces
We propose a different statistical approach which uses a fertility-based IBM Model [6] as word alignment by employing a parallel corpus which is created for the special purpose of Persian multi-part word edition

Summary

Introduction

Persian text consists of words which are made of multiple parts and they are called multi-part words. One of the most common problems in Persian text is incorrectly use of spaces between multi-part words which leads to non-integrity of multi-part words and it leads to incorrect word boundary detection that can be solved by replacing spaces with half-spaces. Based on Persian language spacing rules which specify where space or halfspace is needed, half-spaces must be inserted between parts of multi-part words. If space character is used between the parts of multi-part words, the word doses not obey standard word form and each part will be incorrectly considered as a separate word such as, "‫ "بی شمار‬,"‫ "هیچ گاه‬and "‫"حاصل ضرب‬. The approach finds the stems and affixes of words with Finite State Automaton (FSA) and tags them

Methods

Results

Conclusion