Abstract

Digital traces have become an essential source of data in social sciences because they provide new insights into human behavior and allow studies to be conducted on a larger scale. One particular area of interest is the estimation of various users’ characteristics from their texts on social media. Although it has been established that basic categorical attributes could be effectively predicted from social media posts, the extent to which it applies to more complex continuous characteristics is less understood. In this research, we used data from a nationally representative panel of students to predict their educational outcomes measured by standardized tests from short texts on a popular Russian social networking site VK. We combined unsupervised learning of word embeddings on a large corpus of VK posts with a simple, supervised model trained on individual posts. The resulting model was able to distinguish between posts written by high- and low-performing students with an accuracy of 94%. We then applied the model to reproduce the ranking of 914 high schools from 3 cities and of the 100 largest universities in Russia. We also showed that the same model could predict academic performance from tweets as well as from VK posts. Finally, we explored predictors of high and low academic performance to obtain insights into the factors associated with different educational outcomes.

Highlights

  • In the past decade, digital trace data has become an integral part of social science research [1, 2]

  • The continuous-vocabulary strategy is simpler than state-of-the-art ANN methods [37] and, as a result, allows straightforward interpretation of the predictions and exploration of the differential language use by users, as we demonstrate in the Results section

  • 3 Results 3.1 Prediction We first explored the predictive power of common text features with respect to academic performance

Read more

Summary

Introduction

Digital trace data has become an integral part of social science research [1, 2]. Socio-demographic characteristics such as gender, ethnicity, age, and income were predicted from profile images [6], tweets [7, 8], and Facebook posts [9] The work in this domain is typically focused on basic, often categorical, demographic variables. Examples of such work include predicting personality (see [11] for review) and mental health status from social media activity (see [12] for review) These complex individual-level characteristics were predicted from Facebook likes [13], posts on Facebook [14] or Twitter [15], and Instagram images [16]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call