Abstract

The overarching goal of this research was to gain an understanding of what the data science Reddit online community discussed before, during, and after COVID-19. We used a publicly available Reddit API to harvest the r/datascience subreddit first level post data. We then performed manual annotation to explore the taxonomy of trends and themes discussed by the practitioners who belonged to reddit data science community. Then, we augmented the manually annotated data using a BERT model with topic modeling. In short, the key discussion themes, in order of frequency, were: Education, Jobs, Methods (of data science), Hardware and data collection, Data visualization, and Quality. The Quality theme includes discussions on bias, transparency, and fairness. Hence, a key finding was that there were very few discussions on data science project quality, especially trying to minimize the risk of machine learning bias. As discussions on bias are not yet common, data science teams should proactively identify and address potential questions and concerns that might arise in data science projects, especially the need to increase the team’s focus on potential bias and fairness.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call