Abstract

With the rise of social media, a vast amount of new primary research material has become available to social scientists, but the sheer volume and variety of this make it difficult to access through the traditional approaches: close reading and nuanced interpretations of manual qualitative coding and analysis. This paper sets out to bridge the gap by developing semi-automated replacements for manual coding through a mixture of crowdsourcing and machine learning, seeded by the development of a careful manual coding scheme from a small sample of data. To show the promise of this approach, we attempt to create a nuanced categorisation of responses on Twitter to several recent high profile deaths by suicide. Through these, we show that it is possible to code automatically across a large dataset to a high degree of accuracy (71%), and discuss the broader possibilities and pitfalls of using Big Data methods for Social Science.

Highlights

  • Social science has always had to find ways of moving between the small-scale, interpretative concerns of qualitative research and the large-scale, often predictive concerns of the quantitative

  • As a case study in applying semi-automated coding, this paper looks at public empathy – the expression of empathy that, even if it is imagined to be directed at one other person [2], can potentially be read by many – in the context of high-profile deaths by suicide

  • Whereas previous studies have looked at communal grief and individual mourning in untimely deaths such as that of Michael Jackson [18,21], this paper aims to interrogate discourses and practices around suicide in mediated mourning, an area in which there has been much less of a focus to date

Read more

Summary

Introduction

Social science has always had to find ways of moving between the small-scale, interpretative concerns of qualitative research and the large-scale, often predictive concerns of the quantitative. The application of traditional methods from qualitative social science, such as the close analysis of a small-scale sample of tweets relating to a public death, or the manual application of a coding frame to a larger volume of responses, are likely to miss crucial insights relating to the volume, patterning or dynamics. The quality of crowd-generated labels is ensured by checking agreement among crowdworkers and between the crowd workers’ labels and the golden set This larger labelled dataset is used to train a supervised machine learning model that automatically labels the entire dataset. Our tests show that the final machine generated labels agree with the crowd labels with an accuracy of 71%, which permits nuanced interpretations This is over 5.6x times the accuracy of random baseline, we still need to reconcile the social side of research interpretations with the potentially faulty automatic classification. We allow for this by explicitly quantifying the errors in each of the labels, and drawing interpretations that still stand despite a margin of safety corresponding to these errors

Related literature
Background and approach
Datasets
Analysis approach: semi-automated coding
Bootstrapping coding using manual effort
Coding typology using trained researchers
Scaling the coding using crowd-sourcing
Fine-tuning execution parameters
Validation of crowdsourced labels
Algorithm
Cross-validation
Manual validation
Analysing dynamics of public empathy
On interpreting semi-automated coding
Findings
Qualitative reading through semi-automated coding

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.