Abstract

Corpus linguistics is increasingly employed to explore large, publicly-available datasets such as newspaper texts, government speeches and online fora. However, comparatively few corpora exist where the subject matter concerns sensitive topics about living individuals since, due to their highly personal and confidential nature, these texts are hard to access and raise difficult ethical questions around secondary data analysis. One exception is the Writing in professional social work practice (WiSP) corpus, comprising texts written by UK-based professional social workers in the course of their daily work and now available to other researchers through the ReShare archive. This paper focuses on the challenges involved in building the WiSP corpus and the epistemological and ethical issues raised. Two key aspects of research practice are discussed: data anonymisation and dataset archiving. Specifically, the paper explores decision-making around anonymisation and an ethically-informed rationale for treating some texts as ‘not for sharing’, leading to the decision to create two corpora: one for the research team and a further anonymised and slightly reduced version for archiving. The paper explores what the WiSP corpora (Corpus 1 and Corpus 2) contribute to understandings about social work writing, the extent to which the two corpora enable different analyses and whether the existence of two corpora is problematic from a corpus linguistic perspective. The paper concludes by considering how the ethical decisions around corpus creation of sensitive texts raise questions about key principles in corpus linguistics.

Highlights

  • This paper draws on our experience of working with the UK-based WiSP corpus dataset (Lillis, Leedham and Twiner, 2019), exploring how the creation of a hard-to-access corpus of sensitive texts raises challenging methodological issues in relation to corpus compilation and the additional preparation required to meet the funders’ requirements for secondary archiving

  • Research questions addressed in this paper are: 1) What are the challenges in data preparation for archiving hardto-access, sensitive textual data, around anonymisation coding?

  • A number of difficulties arose: 1) In order to secure sufficient participation and text collection, we involved additional local authorities (LAs); 2) Whilst all social work participants (n = 71) were happy to be interviewed, permission was not given by all LAs to access their written texts; 3) A total of 29 social workers felt they had time available to keep a log of their writing; 4) Writing logs were not always kept over 20 consecutive working days due to other time commitments, sick leave and holidays

Read more

Summary

Introduction

This paper draws on our experience of working with the UK-based WiSP corpus dataset (Lillis, Leedham and Twiner, 2019), exploring how the creation of a hard-to-access corpus of sensitive texts raises challenging methodological issues in relation to corpus compilation and the additional preparation required to meet the funders’ requirements for secondary archiving. We use the compilation of the WiSP corpus as our framing for examining these principles and explore the ethical considerations and the solutions we came to in order to make the corpus more widely available. The paper discusses the methodological, ethical and epistemological considerations around anonymising a relatively small corpus (1 million words) of sensitive texts, considering institutional access issues, our values and commitments as researchers to participants

Corpus compilation of sensitive and hard-to-access texts
Conventional principles of corpus-building
Lack of corpora containing sensitive and hard-to-access texts
On the shift towards archiving
Introducing the WiSP corpus
Building the corpus 1: anonymisation coding
Building the corpus 2: archiving
Comparison of corpus 1 and corpus 2
Exploring the WiSP corpora
Worked example 2: use of quotations in assessment reports
Worked example 3: exploring a lexical item
Summary
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call