Abstract

Cross-lingual semantic interoperability has drawn significant research attention recently, as the number of digital libraries in non-English languages has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish and French, has been widely explored, but CLIR across European and Oriental languages is still at the initial stages. To cross the language boundary, a corpus-based approach shows promise of overcoming the limitations of knowledge-based and controlled vocabulary approaches. However, collecting parallel corpora between European and Oriental languages is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches, and compare their performance in aligning English and Chinese titles of parallel documents available on the Web.KeywordsMachine TranslationChinese CharacterComputational LinguisticsLonge Common SubsequenceParallel CorpusThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.