Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora

Chenhui Chu,Sadao Kurohashi,Toshiaki Nakazawa

doi:10.1145/2833089

Abstract

Parallel corpora are crucial for statistical machine translation (SMT); however, they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract either parallel sentences or fragments from them for SMT. In this article, we propose an integrated system to extract both parallel sentences and fragments from comparable corpora. We first apply parallel sentence extraction to identify parallel sentences from comparable sentences. We then extract parallel fragments from the comparable sentences. Parallel sentence extraction is based on a parallel sentence candidate filter and classifier for parallel sentence identification. We improve it by proposing a novel filtering strategy and three novel feature sets for classification. Previous studies have found it difficult to accurately extract parallel fragments from comparable sentences. We propose an accurate parallel fragment extraction method that uses an alignment model to locate the parallel fragment candidates and an accurate lexicon-based filter to identify the truly parallel fragments. A case study on the Chinese--Japanese Wikipedia indicates that our proposed methods outperform previously proposed methods, and the parallel data extracted by our system significantly improves SMT performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ACM Transactions on Asian and Low-Resource Language Information Processing	Publication Date: Dec 11, 2015
Citations: 15	License type: other-oa

R Discovery Prime

R Discovery Prime

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing

Lead the way for us

Similar Papers

Parallel Sentence Extraction Based on Unsupervised Bilingual Lexicon Extraction from Comparable Corpora
Chenhui Chu ... Sadao Kurohashi
Journal of Natural Language Processing | VOL. 22
Chenhui Chu, et. al.Chenhui Chu ... Sadao Kurohashi
01 Jan 2015
Journal of Natural Language Processing | VOL. 22

Parallel fragments : Measuring their impact on translation performance
Sadaf Abdul-Rauf ... Mohammad Nawaz
Computer Speech & Language | VOL. 43
Sadaf Abdul-Rauf, et. al.Sadaf Abdul-Rauf ... Mohammad Nawaz
21 Dec 2016
Computer Speech & Language | VOL. 43

An Efficient Framework to Extract Parallel Units from Comparable Data
Lu Xiang ... Chengqing Zong
-
Lu Xiang, et. al.Lu Xiang ... Chengqing Zong
01 Jan 2013
01 Jan 2013

A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora
Dilshad Kaur ... Satwinder Singh
Journal of Computer Science | VOL. 17
Dilshad Kaur, et. al.Dilshad Kaur ... Satwinder Singh
01 Oct 2021
Journal of Computer Science | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Asian and Low-Resource Language Information Processing