Abstract


 This paper describes the development of a systematic approach to the creation, management and curation of linguistic resources, particularly spoken language corpora. It also presents first steps towards a framework for continuous quality control to be used within external research projects by non-technical users, and discuss various domain and discipline specific problems and individual solutions. The creation of spoken language corpora is not only a time-consuming and costly process, but the created resources often represent intangible cultural heritage, containing recordings of, for example, extinct languages or historical events. Since high quality resources are needed to enable re-use in as many future contexts as possible, researchers need to be provided with the necessary means for quality control. We believe that this includes methods and tools adapted to Humanities researchers as non-technical users, and that these methods and tools need to be developed to support existing tasks and goals of research projects.

Highlights

  • This paper presents the development of a systematic approach to research data management and data curation for linguistic resources, in particular spoken language corpora, with the specifc aim of enhancing quality control and quality assurance

  • The thorough curation process carried out at the Hamburg Centre for Language Corpora (HZSK)1, a research data centre specializing in language corpora with a thematic focus on linguistic diversity, is based on a software system for quality control, which is one aspect of the quality assurance work described within this paper

  • By working towards continuous quality control and continuous integration, we aim to prevent the high curation costs often involved in making spoken language corpora from research projects re-usable in a wider context

Read more

Summary

Introduction

This paper presents the development of a systematic approach to research data management and data curation for linguistic resources, in particular spoken language corpora, with the specifc aim of enhancing quality control and quality assurance. While the resource type considered in this paper is highly specifc, the general approach and the challenges of cooperative settings are applicable for most contexts in which research data is created or enriched manually for analysis and questions of quality management have still to be answered from the various participants’ perspectives. By working towards continuous quality control and continuous integration, we aim to prevent the high curation costs often involved in making spoken language corpora from research projects re-usable in a wider context. During the frst raw implementation phase the amount of time could be decreased by 30% in comparison to the work before

Related Work
Conclusions from Working with the Framework
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call