Experiments on Cross-Language Information Retrieval Using Comparable Corpora of Chinese, Japanese, and Korean Languages

Kazuaki Kishida,Kuang-Hua Chen

doi:10.1007/978-981-15-5554-1_2

Kazuaki Kishida, Kuang-Hua Chen

Open Access

https://doi.org/10.1007/978-981-15-5554-1_2

Copy DOI

Publication Date: Sep 2, 2020
Citations: 1	License type: CC BY 4.0

Affiliation: Keio University, National Taiwan University

Abstract

This paper describes research activities for exploring techniques of cross-language information retrieval (CLIR) during the NACSIS Test Collection for Information Retrieval/NII Testbeds and Community for Information access Research (NTCIR)-1 to NTCIR-6 evaluation cycles, which mainly focused on Chinese, Japanese, and Korean (CJK) languages. First, general procedures and techniques of CLIR are briefly reviewed. Second, document collections that were used for the research tasks and test collection construction for retrieval experiments are explained. Specifically, CLIR tasks from NTCIR-3 to NTCIR-6 utilized multilingual corpora consisting of newspaper articles that were published in Taiwan, Japan, and Korea during the same time periods. A set of articles can be considered a “pseudo” comparable corpus because many events or affairs are commonly covered across languages in the articles. Such comparable corpora are helpful for comparing the performance of CLIR between pairs of CJK and English. This comparison leads to deeper insights into CLIR techniques. NTCIR CLIR tasks have been built on the basis of test collections that incorporate such comparable corpora. We summarize the technical advances observed in these CLIR tasks at the end of the paper.

Highlights

A “comparable corpus” can be defined as multiple sets of documents, each in different languages, which approximately describe the same things or events
Because the relevance judgment is completed for pooled documents that are extracted from the search results that participants submitted, and not for the entire set of documents, this procedure for creating the answer set is termed the pooling method, which is an efficient means for constructing a large-scale test collection
Research activity for exploring the cross-lingual ad hoc IR of newspaper articles in the NTCIR project ended at the crosslanguage information retrieval (CLIR) task in NTCIR-6, for which the conference was held in May of 2007

Summary

Introduction

A “comparable corpus” can be defined as multiple sets of documents, each in different languages, which approximately describe the same things or events. Explicit alignments of words, sentences, paragraphs, or documents are not necessarily contained in the comparable corpus. In this sense, pairs of scientific abstracts written in Japanese and English that were used for retrieval experiments K. Chen during the first and second NACSIS Test Collection for Information Retrieval/NII Testbeds and Community for Information access Research (NTCIR) evaluation cycles (i.e., NTCIR-1 and -2) as test documents can be considered document-linked comparable corpora

Methods

Results

Conclusion