Abstract

Standardized manual content analysis is an important methodology to capture the messages in journalistic and social media. Specifically, for supervised machine learning aproaches, human-generated training data is needed. The process of coding as well as the selection of suitable coders is crucial for obtaining good data quality. However, little research has been done on how the coding process should be designed and how personal characteristics of the coders might influence data quality. This blind spot becomes even more crucial because coding is nowadays increasingly performed with the help of crowdworkers. When working with such anonymous coders, the process of coding can then be less controlled by the researchers, which can lead to loss of quality. In our comparative mixed-methods study we compare data from a content analysis on the topic of legalizing abortion (n = 300 tweets). We conducted this in two ways: Firstly, with a team of four student coders who also received training and secondly with 150 crowdworkers. All coders had to complete a short survey on their socio-demographics and personality traits. The results show that both validity and reliability are higher for the student coders, especially for tricky coding tasks. Further, multivariate (logistic) regression analysis reveals that personal characteristics such as formal education and emotional sensitivity also have an impact on coding quality. Hence, with a reflective selection of coders as well as a thoughtful design of the coding process and the codebook, the quality of data collection can be increased—even when relying on crowdworkers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call