Abstract
This article describes the development of the digital infrastructure at a research data centre for audio-visual linguistic research data, the Hamburg Centre for Language Corpora (HZSK) at the University of Hamburg in Germany, over the past ten years. The typical resource hosted in the HZSK Repository, the core component of the infrastructure, is a collection of recordings with time-aligned transcripts and additional contextual data, a spoken language corpus. Since the centre has a thematic focus on multilingualism and linguistic diversity and provides its service to researchers within linguistics and other disciplines, the development of the infrastructure was driven by diverse usage scenarios and user needs on the one hand, and by the common technical requirements for certified service centres of the CLARIN infrastructure on the other. Beyond the technical details, the article also aims to be a contribution to the discussion on responsibilities and services within emerging digital research data infrastructures and the fundamental issues in sustainability of research software engineering, concluding that in order to truly cater to user needs across the research data lifecycle, we still need to bridge the gap between discipline-specific research methods in the process of digitalisation and generic digital research data management approaches.
Highlights
Over the last few decades, the development of digital practices in the humanities and social sciences has accelerated and become more widespread, partly along with digitalisation of society in general and partly as a result of targeted funding
Centre on Multilingualism at the University of Hamburg, The Hamburg Centre for Language Corpora (HZSK)5 was founded with the aim to cater for the legacy of curated research data and software from the Special Research Centre and for the newly founded centre to become a part of emerging digital research infrastructures
Further along the research data lifecycle, we find the users who reuse the data provided by the depositors via the HZSK Repository, including both students writing term papers and experienced researchers performing secondary or complementary analyses with existing data
Summary
Over the last few decades, the development of digital practices in the humanities and social sciences has accelerated and become more widespread, partly along with digitalisation of society in general and partly as a result of targeted funding. At the Hamburg Centre for Language Corpora at the University of Hamburg, digital infrastructure and related services have been developed focussing on spoken data in research on multilingualism, linguistic diversity and language documentation, i.e., often the kind of audio-visual annotated linguistic research data referred to as spoken or oral (language) corpora. These complex collection resources comprise various data types, entities and relations which pose a challenge for data modelling and handling. Contributing to the current discussion on reuse and citation of research data and the replicability of research in general, this contribution describes evolving methods for curation, publication and dissemination of complex resource types considering these aspects
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have