Building Parallel Corpora by Automatic Title Alignment

Christopher C Yang,Kar Wing Li

doi:10.1007/3-540-36227-4_38

Abstract

Cross-lingual semantic interoperability has drawn significant research attention recently, as the number of digital libraries in non-English languages has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish and French, has been widely explored, but CLIR across European and Oriental languages is still at the initial stages. To cross the language boundary, a corpus-based approach shows promise of overcoming the limitations of knowledge-based and controlled vocabulary approaches. However, collecting parallel corpora between European and Oriental languages is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches, and compare their performance in aligning English and Chinese titles of parallel documents available on the Web.KeywordsMachine TranslationChinese CharacterComputational LinguisticsLonge Common SubsequenceParallel CorpusThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Building Parallel Corpora by Automatic Title Alignment

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Building parallel corpora by automatic title alignment using length-based and text-based approaches
Christopher C Yang ... Kar Wing Li
Information Processing and Management | VOL. 40
Christopher C Yang, et. al.Christopher C Yang ... Kar Wing Li
29 Jan 2004
Information Processing and Management | VOL. 40

Automatic Construction of Cross-Lingual Networks of Concepts from the Hong Kong SAR Police Department
Kar Wing Li ... Christopher C Yang
-
Kar Wing Li, et. al.Kar Wing Li ... Christopher C Yang
01 Jan 2003
01 Jan 2003

Automatic crosslingual thesaurus generated from the Hong Kong SAR Police Department Web corpus for crime analysis
Kar Wing Li ... Christopher C Yang
Journal of the American Society for Information Science and Technology | VOL. 56
Kar Wing Li, et. al.Kar Wing Li ... Christopher C Yang
12 Jan 2005
Journal of the American Society for Information Science and Technology | VOL. 56

EUROGENE: Multilingual Retrieval and Machine Translation Applied to Human Genetics
Petr Knoth ... Zdenek Zdrahal
-
Petr Knoth, et. al.Petr Knoth ... Zdenek Zdrahal
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Building Parallel Corpora by Automatic Title Alignment

Abstract

Talk to us

Similar Papers