A survey on mining stack overflow: question and answering (Q&A) community
PurposeSoftware developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development.Design/methodology/approachA comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016.FindingsMost of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps.Practical implicationsThe work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development.Originality/valueThe study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.
- Research Article
1
- 10.1142/s0218194023500274
- Jun 26, 2023
- International Journal of Software Engineering and Knowledge Engineering
Stack Overflow is a Q&A website that is popular among developers and extensively used in software engineering (SE) research. A significant body of research has examined how Stack Overflow can assist with software development tasks, such as recommending APIs. However, while researchers have recognized the importance of Stack Overflow in SE research related to software development tasks, the specific ways in which it is utilized and the reasons for its widespread usage in research have not been thoroughly explored. To address these knowledge gaps, we conducted the first study to understand the role of Stack Overflow in assisting with SE research regarding software development tasks by systematically examining relevant and high-quality research works. Meanwhile, we carried out a qualitative survey to gain insight into why researchers choose to utilize Stack Overflow in SE research and to solicit suggestions for the better use of Stack Overflow in research. The study identifies trends in the research area, prominent researchers and organizations, and the types of tasks that utilize Stack Overflow in research, with coding and debugging being the most common. Moreover, it examines how Stack Overflow data is utilized in SE research regarding software development tasks, including searching, training models, and mining associations. Our qualitative survey of researchers indicates that the popularity of Stack Overflow stems from its comprehensive explanations of technical topics that are often not found in documentation or manuals. The findings provide a comprehensive understanding of the role of Stack Overflow in SE research regarding software development tasks, and offer actionable implications for both researchers and stakeholders of Stack Overflow to facilitate future research and improvements.
- Research Article
22
- 10.1016/j.infsof.2019.03.009
- Mar 22, 2019
- Information and Software Technology
Bootstrapping cookbooks for APIs from crowd knowledge on Stack Overflow
- Research Article
24
- 10.1016/j.jss.2022.111427
- Jun 28, 2022
- Journal of Systems and Software
Impact of individualism and collectivism cultural profiles on the behaviour of software developers: A study of stack overflow
- Dissertation
- 10.14393/ufu.te.2019.924
- Apr 12, 2019
Modern-day software development is inseparable from the use of the Application Programming Interfaces (APIs). However, several studies have shown that learning and remembering how to use APIs is difficult for software developers due to inadequate documentation of some APIs. Recent years have witnessed the emergence and growing popularity of social media sites related to software development, such as Stack Overflow, DaniWeb and Quora. The information available on these sites is one important trend in supporting activities related to software development and debugging. In order to address the problem of introducing errors related to incorrect use of the API by the developer, an approach has been proposed which recommends posts from Stack Overflow that may contain the correction of these errors. However, this approach receives as input a code snippet suspected of containing an error. To evaluate this approach, a benchmark was constructed containing 30 code excerpts with potential API-usage-related bugs written in the Java and JavaScript programming languages extracted from the Open Hub Code Search site. The recommendation results showed that 66.67% of Java excerpts with potential API-usage-related bugs had their fixes found in the top-10 query results. Considering JavaScript excerpts, fixes were found in the top-10 results for 40% of them. Moreover, this approach outperformed the Google and FaCoY search engines in recommending fixes for this category of software errors. We have proposed an approach called Lucene+Score+How-to to assist developers during some programming task with a given API. This approach recommends only Q&A pairs from Stack Overflow belonging to the "How-to" category based on a query (list of terms) made in natural language by the user. We conducted experiments to evaluate the recommendation strategy. The programming problems used in the experiments were extracted randomly from cookbooks for three topics widely used by the software development community: Swing (Java), Boost (C++) and LINQ (C#). The results have shown that for 27 of the 35 (77.14%) activities, at least one recommended pair proved to be useful to the target programming problem.
- Research Article
327
- 10.1007/s10664-015-9379-3
- Apr 21, 2015
- Empirical Software Engineering
The popularity of mobile devices has been steadily growing in recent years. These devices heavily depend on software from the underlying operating systems to the applications they run. Prior research showed that mobile software is different than traditional, large software systems. However, to date most of our research has been conducted on traditional software systems. Very little work has focused on the issues that mobile developers face. Therefore, in this paper, we use data from the popular online Q&A site, Stack Overflow, and analyze 13,232,821 posts to examine what mobile developers ask about. We employ Latent Dirichlet allocation-based topic models to help us summarize the mobile-related questions. Our findings show that developers are asking about app distribution, mobile APIs, data management, sensors and context, mobile tools, and user interface development. We also determine what popular mobile-related issues are the most difficult, explore platform specific issues, and investigate the types (e.g., what, how, or why) of questions mobile developers ask. Our findings help highlight the challenges facing mobile developers that require more attention from the software engineering research and development communities in the future and establish a novel approach for analyzing questions asked on Q&A forums.
- Conference Article
6
- 10.1145/3475716.3475778
- Oct 11, 2021
Background The emergence of the COVID-19 pandemic has impacted all human activity, including software development. Early reports seem to indicate that the pandemic may have had a negative effect on software developers, socially and personally, but that their software development productivity may not have been negatively impacted. Aims: Early reports about the effects of the pandemic on software development focused on software developers' well-being and on their productivity as employees. We are interested in a different aspect of software development: the developers' public contributions, as seen in GitHub and Stack Overflow activities. Did the pandemic affect the developers' public contributions and, of so, in what way? Method: Considering the data from between 2017 and till 2020, we study the trends within GitHub's push, create, pull request, and release events, and within Stack Overflow's new users, posts, votes, and comments. We performed linear regressions, correlation analyses, outlier analyses, hypothesis testing, and we also contacted individual developers in order to gather qualitative insights about their unusual public contributions. Results: Our study shows that within GitHub and Stack Overflow, the onset of the pandemic (March/April 2020) is reflected in a set of outliers in developers' contributions that point to an increase in activity. The distributions of contributions during the entire year of 2020 were, in some aspects, different, but, in other aspects, similar from the recent past. Additionally, we found one noticeably disrupted pattern of contribution in Stack Overflow, namely the ratio Questions/Answers, which was much higher in 2020 than before. Testimonials from the developers we contacted were mixed: while some developers reported that their increase in activity was due to the pandemic, others reported that it was not. Conclusion: In Github, there was a noticeable increase in public software development activity in 2020, as well as more abrupt changes in daily activities; in Stack Overflow, there was a noticeable increase in new users and new questions at the onset of the pandemic, and in the ratio of Questions/Answers during 2020. The results may be attributed to the pandemic, but other factors could have come into play.
- Research Article
104
- 10.1007/s10664-018-9650-5
- Oct 1, 2018
- Empirical Software Engineering
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.
- Dissertation
5
- 10.31274/etd-20200902-68
- Sep 4, 2020
Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To fill that gap this thesis reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikitlearn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. Our findings reveal the urgent need for software engineering (SE) research in this area. The second part of the thesis particularly focuses on understanding the Deep Neural Network (DNN) bug characteristics. We study 2,716 high-quality posts from Stack Overflow and 500 bug fix commits from Github about five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand the types of bugs, their root causes and impacts, bug-prone stage of deep learning pipeline as well as whether there are some common antipatterns found in this buggy software. While exploring the bug characteristics, our findings imply that repairing software that uses DNNs is one such unmistakable SE need where automated tools could be beneficial; however, we do not fully understand challenges to repairing and patterns that are utilized when manually repairing DNNs. So, the third part of this thesis presents a comprehensive study of bug fix patterns to address these questions. We have studied 415 repairs from Stack Overflow and 555 repairs from Github for five popular deep learning libraries Caffe, Keras, Tensorflow, Theano, and Torch to understand challenges in repairs and bug repair patterns. Our key findings reveal that DNN bug fix patterns are distinctive compared to traditional bug fix patterns and the most common bug fix patterns are fixing data dimension and neural network connectivity. Finally, we propose an automatic technique to detect ML Application Programming Interface (API) misuses. We started with an empirical study to understand ML API misuses. Our study shows that ML API misuse is prevalent and distinct compared to non-ML API misuses. Inspired by these findings, we contributed Amimla (Api Misuse In Machine Learning Apis) an approach and a tool for ML API misuse detection. Amimla relies on several technical innovations. First, we proposed an abstract representation of ML pipelines to use in misuse detection. Second, we proposed an abstract representation of neural networks for deep learning related APIs. Third, we have developed a representation strategy for constraints on ML APIs. Finally, we have developed a misuse detection strategy for both single and multi-APIs. Our experimental evaluation shows that Amimla achieves a high average accuracy of ∼80% on two benchmarks of misuses from Stack Overflow and Github.
- Research Article
1
- 10.5281/zenodo.3761710
- Apr 18, 2020
- Zenodo (CERN European Organization for Nuclear Research)
The Covid-19 outbreak, beyond its tragic effects, has changed to an unprecedented extent almost every aspect of human activity throughout the world. At the same time, the pandemic has stimulated enormous amount of research by scientists across various disciplines, seeking to study the phenomenon itself, its epidemiological characteristics and ways to confront its consequences. Information Technology, and particularly Data Science, drive innovation in all related to Covid-19 biomedical fields. Acknowledging that software developers routinely resort to open question and answer communities like Stack Overflow to seek advice on solving technical issues, we have performed an empirical study to investigate the extent, evolution and characteristics of Covid-19 related posts. In particular, through the study of 464 Stack Overflow questions posted mainly in February and March 2020 and leveraging the power of text mining, we attempt to shed light into the interest of developers in Covid-19 related topics and the most popular technological problems for which the users seek information. The findings reveal that indeed this global crisis sparked off an intense and increasing activity in Stack Overflow with most post topics reflecting a strong interest on the analysis of Covid-19 data, primarily using Python technologies.
- Research Article
10
- 10.1016/j.jss.2024.111964
- Jan 8, 2024
- Journal of Systems and Software
An empirical study of code reuse between GitHub and stack overflow during software development
- Research Article
6
- 10.1142/s0218194021500467
- Oct 1, 2021
- International Journal of Software Engineering and Knowledge Engineering
GitHub and Stack Overflow are often used together for software development. GH-SO users, who use both GitHub and Stack Overflow, contribute to the development of various software projects in GitHub and share their knowledge and experience on software development in Stack Overflow. To widely understand the interests and working habits of GH-SO users on software development, it is important to investigate how GH-SO users utilize GitHub and Stack Overflow. In this paper, we present an exploratory study on GitHub commit and Stack Overflow post activities of GH-SO users. Specifically, we investigate the working habits of GH-SO users on GitHub commit and Stack Overflow post activities. We randomly selected 19,756 of GH-SO users as our target sample and collected 2,819,483 and 2,147,317 of commit activity data and post activity data of the GH-SO users. We then categorized the collected commit and post activity datasets into specific categories on programming languages and statistically analyzed the categorized commit and post activity datasets. As the results of our analysis, we found the following: (1) The overall commit and post activities of the GH-SO users share some similarity. (2) The commit activities gradually change while the post activities drastically change over time. (3) The commit activities of the GH-SO users are broadly distributed while the post activities are narrowly distributed and the commit activity can be better predictor for post activity. (4) The commit activity of the GH-SO users tends to be performed prior post activity. We believe that our findings can contribute to finding the ways to better support commit and post activities of GitHub and Stack Overflow users.
- Conference Article
261
- 10.1145/2884781.2884800
- May 14, 2016
Software developers need access to different kinds of information which is often dispersed among different documentation sources, such as API documentation or Stack Overflow. We present an approach to automatically augment API documentation with "insight sentences" from Stack Overflow- sentences that are related to a particular API type and that provide insight not contained in the API documentation of that type. Based on a development set of 1,574 sentences, we compare the performance of two state-of-the-art summarization techniques as well as a pattern-based approach for insight sentence extraction. We then present SISE, a novel machine learning based approach that uses as features the sentences themselves, their formatting, their question, their answer, and their authors as well as part-of-speech tags and the similarity of a sentence to the corresponding API documentation. With SISE, we were able to achieve a precision of 0.64 and a coverage of 0.7 on the development set. In a comparative study with eight software developers, we found that SISE resulted in the highest number of sentences that were considered to add useful information not found in the API documentation. These results indicate that taking into account the meta data available on Stack Overflow as well as part-of-speech tags can significantly improve unsupervised extraction approaches when applied to Stack Overflow data.
- Conference Article
18
- 10.1145/3468264.3473114
- Aug 18, 2021
Stack Overflow is one of the most popular technical Q&A sites used by software developers. Seeking help from Stack Overflow has become an essential part of software developers’ daily work for solving programming-related questions. Although the Stack Overflow community has provided quality assurance guidelines to help users write better questions, we observed that a significant number of questions submitted to Stack Overflow are of low quality. In this paper, we introduce a new web-based tool, Code2Que, which can help developers in writing higher quality questions for a given code snippet. Code2Que consists of two main stages: offline learning and online recommendation. In the offline learning phase, we first collect a set of good quality ⟨code snippet, question⟩ pairs as training samples. We then train our model on these training samples via a deep sequence-to-sequence approach, enhanced with an attention mechanism, a copy mechanism and a coverage mechanism. In the online recommendation phase, for a given code snippet, we use the offline trained model to generate question titles to assist less experienced developers in writing questions more effectively. To evaluate Code2Que, we first sampled 50 low quality ⟨code snippet, question⟩ pairs from the Python and Java datasets on Stack Overflow. Then we conducted a user study to evaluate the question titles generated by our approach as compared to human-written ones using three metrics: Clearness, Fitness and Willingness to Respond. Our experimental results show that for a large number of low-quality questions in Stack Overflow, Code2Que can improve the question titles in terms of Clearness, Fitness and Willingness measures.
- Conference Article
270
- 10.1109/sp.2017.31
- May 1, 2017
S.121-136
- Dissertation
1
- 10.32657/10356/75873
- Jan 1, 2018
With software penetrating into all kinds of traditional or emerging industries, there is a great demand on software development. Faced with the fact that there is a limited number of developers, one important way to meet such urgent needs is to significantly improve developers’ productivity. As the most popular Q&A site, Stack Overflow has accumulated abundant software development knowledge. Effectively leveraging such a big data can help developers reuse the experience there to further improve their working efficiency. However, the rich yet unstructured large-scale data in Stack Overflow makes it difficult to search due to two reasons. First, there are too many questions and answers within the site, and there may be lingual gap (the same meaning can be written in different languages) between the query and content in Stack Overflow. In addition, the decay of information quality such as misspelling, inconsistency, and abuse of domain-specific abbreviations aggravates the search performance. Second, some higher-order knowledge in Stack Overflow is implicit for searching and it needs certain distillation from existing raw data. In this thesis, I present methods for supporting developers’ information search over Stack Overflow. To overcome the lexical gap and information decay, I also develop an edit recommendation tool to ensure the post quality of Stack Overflow so that posts can be more easily searched by the query. But such explicit information search still requires developers to read, understand and summarize, which is time-consuming. So I propose to shift from the document (information) search to entity (knowledge) search by mining the implicit knowledge from tags in Stack Overflow to render direct answers to developers instead of asking them to read lengthy documents. I first build a basic software-specific knowledge graph including thousands of software-engineering terms and their associations by association rule mining and community detection. Then, I enrich the knowledge graph with more fine-grained relationships i.e., analogy among different third-party libraries. Finally, I combine both semantic and lexical information to infer morphological forms of software terms so that the knowledge graph is more robust for knowledge search.