Abstract

Segmentation is of great importance to statistical machine translation. It splits a source sentence into sequences of translatable segments. We propose a maximum-entropy segmentation model to capture desirable phrasal and hierarchical segmentations for statistical machine translation. We present an approach to automatically learning the beginning and ending boundaries of cohesive segments from word-aligned bilingual data without using any additional resources. The learned boundaries are then used to define cohesive segments in both phrasal and hierarchical segmentations. We integrate the segmentation model into phrasal statistical machine translation (SMT) and conduct experiments on the newswire and broadcast news domain to investigate the effectiveness of the proposed segmentation model on a large-scale training data. Our experimental results show that the maximum-entropy segmentation model significantly improves translation quality in terms of BLEU. We further validate that 1) the proposed segmentation model significantly outperforms syntactic constraints which are used in previous work to constrain segmentations; and 2) it is necessary to capture hierarchical segmentations besides phrasal segmentations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.