Abstract

This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.

Highlights

  • While corpus-based contrastive studies largely rely on translation corpora, they increasingly draw on comparable data

  • Unlike extensive comparable corpora mined from the web which are used in natural language processing for the development of machine translation and crosslingual information retrieval systems (Sharoff et al 2013), the ultimate goal of the International Comparable Corpus (ICC), a collaborative project of currently twelve

  • The ICC starts with the idea of linguistic data reusability, and contributes to a discussion of data sustainability, on the one hand, and the current lack of comparable datasets for contrastive studies, on the other

Read more

Summary

INTRODUCTION

While corpus-based contrastive studies largely rely on translation (parallel) corpora, they increasingly draw on comparable data (see, e.g., Mauranen 1998; Aijmer and Altenberg 2013). The ICC starts with the idea of linguistic data reusability, and contributes to a discussion of data sustainability, on the one hand, and the current lack of comparable datasets for contrastive studies, on the other. A substantial proportion of the current landscape in contrastive studies is based on comparisons of pairs of languages, very often one of those languages being English. There is no doubt that one of the contributing factors to this two-language English-centered research is a lack of suitable linguistic resources. The aim of the ICC is, to provide a highly comparable, multilingual dataset of both spoken and written language to support contrastive and cross-linguistic research.. Possibilities and problems concerning the ICC data release, as well as the dissemination of the corpus to the wider research community

DESIGNING THE ICC
COMPILING THE ICC
The ICC written component
The ICC spoken component
MAKING THE ICC AVAILABLE
Findings
CONCLUSIONS AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call