What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

Stefanie Beyer,Christian Macho,Martin Pinzger,Massimiliano Di Penta

doi:10.1007/s10664-019-09758-x

Stefanie Beyer, Christian Macho + Show 2 more

Open Access

https://doi.org/10.1007/s10664-019-09758-x

Copy DOI

Abstract

On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.

Highlights

Many developers use question and answer forums, such as Stack Overflow (SO), to discuss and solve their development issues
We can answer our research question RQ-2.1 What is the performance of our regex approach for classifying Stack Overflow posts into the 7 question categories? as follows: With the regex approach, we can classify a post into the correct question category with an average precision, recall, and Mathews correlation coefficient (MCC) of 0.91, 0.91, and 0.68, respectively
To determine the best configuration for classifying posts into the seven question categories, we compare the best performing models obtained with Random Forest (RF) and Support Vector Machines (SVM) based on their performance metrics

Summary

Introduction

Many developers use question and answer forums, such as Stack Overflow (SO), to discuss and solve their development issues. There are more than 16,000,000 diverse questions on SO that deal with developers’ problems. For these questions, there exist more than 27,000,000 answer posts. On the one hand side this is good, since it enables developers to find solutions for their problems, on the other hand it is challenging to find the right solution in such a large amount of posts. Developers ask for a better data organization of Q&A forums to increase the search efficiency and limit the time to find adequate solutions (Wu et al 2018)

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Empirical Software Engineering	Publication Date: Aug 28, 2019
Citations: 45	License type: open-access

R Discovery Prime

R Discovery Prime

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Empirical Software Engineering

Lead the way for us

Similar Papers

Automatically classifying posts into question categories on stack overflow
Stefanie Beyer ... Christian Macho
-
Stefanie Beyer, et. al.Stefanie Beyer ... Christian Macho
28 May 2018
28 May 2018

Classification of Database Technology Problems on Stack Overflow
Nuttanai Suwonchoochit ... Twittie Senivongse
-
Nuttanai Suwonchoochit, et. al.Nuttanai Suwonchoochit ... Twittie Senivongse
20 Jun 2021
20 Jun 2021

Semantic segmentation of PolSAR image data using advanced deep learning model
Rajat Garg ... Shashi Kumar
Scientific Reports | VOL. 11
Rajat Garg, et. al.Rajat Garg ... Shashi Kumar
28 Jul 2021
Scientific Reports | VOL. 11

The Classification Performance and Mechanism of Machine Learning Algorithms in Winter Wheat Mapping Using Sentinel-2 10 m Resolution Imagery
Peng Fang ... Xiwang Zhang
Applied Sciences | VOL. 10
Peng Fang, et. al.Peng Fang ... Xiwang Zhang
23 Jul 2020
Applied Sciences | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

What kind of questions do developers ask on Stack Overflow? A comparison of automated approaches to classify posts into question categories

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Empirical Software Engineering