Challenges of combining structured and unstructured data in corpus development

Tanja Säily,Jukka Tyrkkö

doi:10.32714/ricl.09.01.01

Abstract

Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. However, combining unstructured text with images, video, audio as well as structured metadata poses a variety of challenges to corpus compilers. This paper presents an overview of the topic to contextualise this special issue of Research in Corpus Linguistics. The aim of the special issue is to highlight some of the challenges faced and solutions developed in several recent and ongoing corpus projects. Rather than providing overall descriptions of corpora, each contributor discusses specific challenges they faced in the corpus development process, summarised in this paper. We hope that the special issue will benefit future corpus projects by providing solutions to common problems and by paving the way for new best practices for the compilation and development of rich-data corpora. We also hope that this collection of articles will help keep the conversation going on the theoretical and methodological challenges of corpus compilation.

Highlights

Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities
Small- and medium-sized corpora that match the original definitions of linguistic corpora more closely continue to be used and developed
As a consequence of technological developments, linguistic corpora comprising these kinds of ‘rich’ data have become increasingly realistic to compile, but that does not mean that all the related challenges are already solved

Summary

Introduction

Recent advances in the availability of ever larger and more varied electronic datasets, both historical and modern, provide unprecedented opportunities for corpus linguistics and the digital humanities. Metadata describing the texts or authors included in a corpus can be broken down into systematic variables, such as year of publication, genre, or level of education, which facilitate focused queries or the comparison of search results between subsections of the dataset.

Results

Conclusion