Sentence Alignment Using Hybrid Model

M.A Fattah,Fuji Ren Fuji Ren,S Kuroiwa

doi:10.1109/nlpke.2005.1598768

Abstract

Parallel corpora have become an essential resource for work in multilingual natural language processing. However, sentence aligned parallel corpora are more efficient than non-aligned parallel corpora for cross language information retrieval and machine translation applications. In this paper, we present a new approach to aligning sentences in bilingual parallel corpora based on the text character length between successive punctuates. A probabilistic score is assigned to each proposed correspondence of texts, based on the scaled difference of lengths of the two texts (in characters) and the variance of this difference. Using this score, the time required for punctuates matching decreased and the sentence alignment precision increased. Using this new approach, we could achieve 21.8% improvement over length based approach when applied on English-Arabic parallel documents.

Full Text