StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge
Stack Overflow is a popular on-line programming question and answer community providing its participants with rapid access to knowledge and expertise of their peers, especially benefitting coders. Despite the popularity of Stack Overflow, its role in the work cycle of open-source developers is yet to be understood: on the one hand, participation in it has the potential to increase the knowledge of individual developers thus improving and speeding up the development process. On the other hand, participation in Stack Overflow may interrupt the regular working rhythm of the developer, hence also possibly slow down the development process. In this paper we investigate the interplay between Stack Overflow activities and the development process, reflected by code changes committed to the largest social coding repository, GitHub. Our study shows that active GitHub committers ask fewer questions and provide more answers than others. Moreover, we observe that active Stack Overflow askers distribute their work in a less uniform way than developers that do not ask questions. Finally, we show that despite the interruptions incurred, the Stack Overflow activity rate correlates with the code changing activity in GitHub.
- Research Article
63
- 10.1007/s10664-016-9430-z
- Apr 19, 2016
- Empirical Software Engineering
Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.
- Research Article
53
- 10.1109/tse.2019.2954319
- Dec 5, 2019
- IEEE Transactions on Software Engineering
Stack Overflow is one of the most active communities for developers to share their programming knowledge. Answers posted on Stack Overflow help developers solve issues during software development. In addition to posting answers, users can also post comments to further discuss their associated answers. As of Aug 2017, there are 32.3 million comments that are associated with answers, forming a large collection of crowdsourced repository of knowledge on top of the commonly-studied Stack Overflow answers. In this study, we wish to understand how the commenting activities contribute to the crowdsourced knowledge. We investigate what users discuss in comments, and analyze the characteristics of the commenting dynamics, (i.e., the timing of commenting activities and the roles of commenters). We find that: 1) the majority of comments are informative and thus can enhance their associated answers from a diverse range of perspectives. However, some comments contain content that is discouraged by Stack Overflow. 2) The majority of commenting activities occur after the acceptance of an answer. More than half of the comments are fast responses occurring within one day of the creation of an answer, while later comments tend to be more informative. Most comments are rarely integrated back into their associated answers, even though such comments are informative. 3) Insiders (i.e., users who posted questions/answers before posting a comment in a question thread) post the majority of comments within one month, and outsiders (i.e., users who never posted any question/answer before posting a comment) post the majority of comments after one month. Inexperienced users tend to raise limitations and concerns while experienced users tend to enhance the answer through commenting. Our study provides insights into the commenting activities in terms of their content, timing, and the individuals who perform the commenting. For the purpose of long-term knowledge maintenance and effective information retrieval for developers, we also provide actionable suggestions to encourage Stack Overflow users/engineers/moderators to leverage our insights for enhancing the current Stack Overflow commenting system for improving the maintenance and organization of the crowdsourced knowledge.
- Research Article
60
- 10.1108/dta-07-2017-0054
- Feb 9, 2018
- Data Technologies and Applications
PurposeSoftware developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development.Design/methodology/approachA comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016.FindingsMost of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps.Practical implicationsThe work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development.Originality/valueThe study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.
- Research Article
7
- 10.1007/s10664-021-10028-y
- Nov 3, 2021
- Empirical Software Engineering
On Stack Overflow, users reuse 11,926,354 external links to share the resources hosted outside the Stack Overflow website. The external links connect to the existing programming-related knowledge and extend the crowdsourced knowledge on Stack Overflow. Some of the external links, so-called as repeated external links, can be shared for multiple times. We observe that 82.5% of the link sharing activities (i.e., sharing links in any question, answer, or comment) on Stack Overflow share external resources, and 57.0% of the occurrences of the external links are sharing the repeated external links. However, it is still unclear what types of external resources are repeatedly shared. To help users manage their knowledge, we wish to investigate the characteristics of the repeated external links in knowledge sharing on Stack Overflow. In this paper, we analyze the repeated external links on Stack Overflow. We observe that external links that point to the text resources (hosted in documentation websites, tutorial websites, etc.) are repeatedly shared the most. We observe that: 1) different users repeatedly share the same knowledge in the form of repeated external links, thus increasing the maintenance effort of knowledge (e.g., update invalid links in multiple posts), 2) the same users can repeatedly share the external links for the purpose of promotion, and 3) external links can point to webpages with an overload of information that is difficult for users to retrieve relevant information. Our findings provide insights to Stack Overflow moderators and researchers. For example, we encourage Stack Overflow to centrally manage the commonly occurring knowledge in the form of repeated external links in order to better maintain the crowdsourced knowledge on Stack Overflow.
- Conference Article
1
- 10.1109/qrs51102.2020.00066
- Dec 1, 2020
Stack Overflow (SO) is one of the most popular Q&A sites for not only providing valuable information to software developers but also encouraging the sharing of knowledge and problem solving. Unfortunately, the information provided by SO is not always sufficient for developers. In this paper, we empirically show how fast and effectively historical code changes can substitute for missing or unanswered SO articles. Developers in all around the world encounter many problems daily and their trial-and-error experiences to resolve the problems are accumulated in the code change history. The extracted source code differences are expected to provide valuable information to developers before the questions and answers are posted on SO. In our study, we focus on the usage of APIs as the topic of SO articles, because many developers are interested in API programming and suffer from the problems related to API usage. We extracted 4,780 code differences from 713 repositories of Android applications (F-Droid). As a result, we found that 64% of SO articles on Android APIs are related to code differences, whereas 44% of code differences are related to SO articles. Not a few code differences appear before the corresponding SO articles are actually posted. The median of time lag between the first appearance of code changes and the first actual SO postings is 22 months.
- Research Article
10
- 10.1007/s11280-018-0621-y
- Jul 23, 2018
- World Wide Web
Software developers need access to correlated information (e.g., API documentation, Wikipedia pages, Stack Overflow questions and answers) which are often dispersed among different Web resources. This paper is concerned with the situation where a developer is visiting a Web page, but at the same time is willing to explore correlated Web resources to extend his/her knowledge or to satisfy his/her curiosity. Specifically, we present an item-based collaborative filtering technique, named LinkLive, for automatically recommending a list of correlated Web resources for a particular Web page. The recommendation is done by exploiting hyperlink associations from the crowdsourced knowledge on Stack Overflow. We motivate our research using an exploratory study of hyperlink dissemination patterns on Stack Overflow. We then present our LinkLive technique that uses multiple features, including hyperlink co-occurrences in Q&A discussions, locations (e.g., question, answer, or comment) in which hyperlinks are referenced, and votes for posts/comments in which hyperlinks are referenced. Experiments using 7 years of Stack Overflow data show that, our technique recommends correlated Web resources with promising accuracy in an open setting. A user study of 6 participants suggests that practitioners find the recommended Web resources useful for Web discovery.
- Research Article
17
- 10.1109/access.2023.3238813
- Jan 1, 2023
- IEEE Access
Web Services (WSs) are gaining worldwide popularity due to reliable and fast intercommunication services for the development of web and mobile applications. WSs are provided to client application developers through web Application Programming Interfaces (APIs), such as YouTube API, Twitter API, Facebook API, etc. Due to the popularity of WSs, the developers frequently discuss various WSs-based application’ issues on online forums, such as Stack Overflow (SO). This study aims to highlight the problems faced by client developers in the development process of WSs-based applications using the dataset of SO. The comprehension of developers’ conversations on SO can give insight into the frequency, difficulty, and popularity of different WSs-related problems of developers. We downloaded 12,746 posts from SO relevant to WSs-related issues for this article. We used the topic modeling technique (LDA) to extract various topics from the SO dataset. The topics are labeled and organized into categories and sub-categories according to relationships among them. The difficulty and popularity of each topic have been analyzed. Our investigation yield several findings. First, developers focus on six topics related to WSs on SO: Client APIs development, Data Processing, Web services Authorization, Framework Support, Web APIs, and Mobile Applications. Secondly, the advantages and disadvantages of web applications topic (Fused_Popularity=0.39), from the Clients APIs development category have the highest prevalence, followed by Database (DB) and Data Processing in Applications topic (Fused_Popularity=0.38) from the Data Processing category. Third, most WSs-related topics in all categories are evolving promptly on SO, i.e., new questions are added daily about WSs development, deployment, and authorization. Fourth, the questions of type “how” are primarily asked in Framework support, Client APIs development, and Web APIs categories. Although, many questions in other categories are of the kind “What”. It is also observed that WSs developers not only used SO to ask How and What types of questions but they also used SO to ask information-seeking questions (i.e., in Data processing and Client APIs development categories). Fifth, the topics relevant to Web APIs (Fused_Popularity=10.8) and Client API Development ((Fused_Popularity=9.35) categories of WSs are very popular on SO. Sixth, the questions relevant to the Web APIs (Fused_Difficulty =3) & Client APIs development (Fused_Difficulty=2.25) categories are more difficult than the other four categories. The results of our research may be helpful for the following WSs stakeholders: WSs Client application developers, WSs Educators, and WSs researchers. The WSs Educators and investigators can get more knowledge of new methods and discover novel techniques to make challenging WSs topics easy to understand. WSs framework developers can utilize our extracted WSs topics and categories to know the preferences of WSs developers that may support them in upgrading existing frameworks or developing new ones.
- Research Article
104
- 10.1007/s10664-018-9650-5
- Oct 1, 2018
- Empirical Software Engineering
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.
- Dissertation
1
- 10.32657/10356/75873
- Jan 1, 2018
With software penetrating into all kinds of traditional or emerging industries, there is a great demand on software development. Faced with the fact that there is a limited number of developers, one important way to meet such urgent needs is to significantly improve developers’ productivity. As the most popular Q&A site, Stack Overflow has accumulated abundant software development knowledge. Effectively leveraging such a big data can help developers reuse the experience there to further improve their working efficiency. However, the rich yet unstructured large-scale data in Stack Overflow makes it difficult to search due to two reasons. First, there are too many questions and answers within the site, and there may be lingual gap (the same meaning can be written in different languages) between the query and content in Stack Overflow. In addition, the decay of information quality such as misspelling, inconsistency, and abuse of domain-specific abbreviations aggravates the search performance. Second, some higher-order knowledge in Stack Overflow is implicit for searching and it needs certain distillation from existing raw data. In this thesis, I present methods for supporting developers’ information search over Stack Overflow. To overcome the lexical gap and information decay, I also develop an edit recommendation tool to ensure the post quality of Stack Overflow so that posts can be more easily searched by the query. But such explicit information search still requires developers to read, understand and summarize, which is time-consuming. So I propose to shift from the document (information) search to entity (knowledge) search by mining the implicit knowledge from tags in Stack Overflow to render direct answers to developers instead of asking them to read lengthy documents. I first build a basic software-specific knowledge graph including thousands of software-engineering terms and their associations by association rule mining and community detection. Then, I enrich the knowledge graph with more fine-grained relationships i.e., analogy among different third-party libraries. Finally, I combine both semantic and lexical information to infer morphological forms of software terms so that the knowledge graph is more robust for knowledge search.
- Conference Article
6
- 10.1145/3475716.3475778
- Oct 11, 2021
Background The emergence of the COVID-19 pandemic has impacted all human activity, including software development. Early reports seem to indicate that the pandemic may have had a negative effect on software developers, socially and personally, but that their software development productivity may not have been negatively impacted. Aims: Early reports about the effects of the pandemic on software development focused on software developers' well-being and on their productivity as employees. We are interested in a different aspect of software development: the developers' public contributions, as seen in GitHub and Stack Overflow activities. Did the pandemic affect the developers' public contributions and, of so, in what way? Method: Considering the data from between 2017 and till 2020, we study the trends within GitHub's push, create, pull request, and release events, and within Stack Overflow's new users, posts, votes, and comments. We performed linear regressions, correlation analyses, outlier analyses, hypothesis testing, and we also contacted individual developers in order to gather qualitative insights about their unusual public contributions. Results: Our study shows that within GitHub and Stack Overflow, the onset of the pandemic (March/April 2020) is reflected in a set of outliers in developers' contributions that point to an increase in activity. The distributions of contributions during the entire year of 2020 were, in some aspects, different, but, in other aspects, similar from the recent past. Additionally, we found one noticeably disrupted pattern of contribution in Stack Overflow, namely the ratio Questions/Answers, which was much higher in 2020 than before. Testimonials from the developers we contacted were mixed: while some developers reported that their increase in activity was due to the pandemic, others reported that it was not. Conclusion: In Github, there was a noticeable increase in public software development activity in 2020, as well as more abrupt changes in daily activities; in Stack Overflow, there was a noticeable increase in new users and new questions at the onset of the pandemic, and in the ratio of Questions/Answers during 2020. The results may be attributed to the pandemic, but other factors could have come into play.
- Research Article
10
- 10.1016/j.jss.2024.111964
- Jan 8, 2024
- Journal of Systems and Software
An empirical study of code reuse between GitHub and stack overflow during software development
- Conference Article
270
- 10.1109/sp.2017.31
- May 1, 2017
S.121-136
- Research Article
24
- 10.1016/j.jss.2022.111427
- Jun 28, 2022
- Journal of Systems and Software
Impact of individualism and collectivism cultural profiles on the behaviour of software developers: A study of stack overflow
- Conference Article
26
- 10.1109/icpc52881.2021.00015
- May 1, 2021
A fast and effective approach to obtain information regarding software development problems is to search them to find similar solved problems or post questions on community question answering (CQA) websites. Solving coding problems in a short time is important, so these CQAs have a considerable impact on the software development process. However, if developers do not get their expected answers, the websites will not be useful, and software development time will increase. Stack Overflow is the most popular CQA concerning programming problems. According to its rules, the only sign that shows a question poser has achieved the desired answer is the user's acceptance. In this paper, we investigate unresolved questions, without accepted answers, on Stack Overflow. The number of unresolved questions is increasing. As of August 2019, 47% of Stack Overflow questions were unresolved. In this study, we analyze the effectiveness of various features, including some novel features, to resolve a question. We do not use the features that contain information not present at the time of asking a question, such as answers. To evaluate our features, we deploy several predictive models trained on the features of 18 million questions to predict whether a question will get an accepted answer or not. The results of this study show a significant relationship between our proposed features and getting accepted answers. Finally, we introduce an online tool that predicts whether a question will get an accepted answer or not. Currently, Stack Overflow's users do not receive any feedback on their questions before asking them, so they could carelessly ask unclear, unreadable, or inappropriately tagged questions. By using this tool, they can modify their questions and tags to check the different results of the tool and deliberately improve their questions to get accepted answers.
- Research Article
6
- 10.1142/s0218194021500467
- Oct 1, 2021
- International Journal of Software Engineering and Knowledge Engineering
GitHub and Stack Overflow are often used together for software development. GH-SO users, who use both GitHub and Stack Overflow, contribute to the development of various software projects in GitHub and share their knowledge and experience on software development in Stack Overflow. To widely understand the interests and working habits of GH-SO users on software development, it is important to investigate how GH-SO users utilize GitHub and Stack Overflow. In this paper, we present an exploratory study on GitHub commit and Stack Overflow post activities of GH-SO users. Specifically, we investigate the working habits of GH-SO users on GitHub commit and Stack Overflow post activities. We randomly selected 19,756 of GH-SO users as our target sample and collected 2,819,483 and 2,147,317 of commit activity data and post activity data of the GH-SO users. We then categorized the collected commit and post activity datasets into specific categories on programming languages and statistically analyzed the categorized commit and post activity datasets. As the results of our analysis, we found the following: (1) The overall commit and post activities of the GH-SO users share some similarity. (2) The commit activities gradually change while the post activities drastically change over time. (3) The commit activities of the GH-SO users are broadly distributed while the post activities are narrowly distributed and the commit activity can be better predictor for post activity. (4) The commit activity of the GH-SO users tends to be performed prior post activity. We believe that our findings can contribute to finding the ways to better support commit and post activities of GitHub and Stack Overflow users.