Studying Versioning in Stack Overflow
In Stack Overflow (SO), a post consists of multiple components: title, question, answers, question tags, and comments. Developers can create any of these components and make changes, which we call 'edits'. Edits are an important aspect of QA websites to ensure the quality and correctness of the texts. We performed multiple analyses on the revision history of 23 million SO posts from 2008 to 2023, and we gain a more comprehensive understanding of developers' content maintenance behaviors which lay the foundation for further research.
- Research Article
7
- 10.1007/s10664-021-10028-y
- Nov 3, 2021
- Empirical Software Engineering
On Stack Overflow, users reuse 11,926,354 external links to share the resources hosted outside the Stack Overflow website. The external links connect to the existing programming-related knowledge and extend the crowdsourced knowledge on Stack Overflow. Some of the external links, so-called as repeated external links, can be shared for multiple times. We observe that 82.5% of the link sharing activities (i.e., sharing links in any question, answer, or comment) on Stack Overflow share external resources, and 57.0% of the occurrences of the external links are sharing the repeated external links. However, it is still unclear what types of external resources are repeatedly shared. To help users manage their knowledge, we wish to investigate the characteristics of the repeated external links in knowledge sharing on Stack Overflow. In this paper, we analyze the repeated external links on Stack Overflow. We observe that external links that point to the text resources (hosted in documentation websites, tutorial websites, etc.) are repeatedly shared the most. We observe that: 1) different users repeatedly share the same knowledge in the form of repeated external links, thus increasing the maintenance effort of knowledge (e.g., update invalid links in multiple posts), 2) the same users can repeatedly share the external links for the purpose of promotion, and 3) external links can point to webpages with an overload of information that is difficult for users to retrieve relevant information. Our findings provide insights to Stack Overflow moderators and researchers. For example, we encourage Stack Overflow to centrally manage the commonly occurring knowledge in the form of repeated external links in order to better maintain the crowdsourced knowledge on Stack Overflow.
- Conference Article
88
- 10.1109/saner.2017.7884629
- Feb 1, 2017
Developers use Question and Answer (Q&A) websites to exchange knowledge and expertise. Stack Overflow is a popular Q&A website where developers discuss coding problems and share code examples. Although all Stack Overflow posts are free to access, code examples on Stack Overflow are governed by the Creative Commons Attribute-ShareAlike 3.0 Unported license that developers should obey when reusing code from Stack Overflow or posting code to Stack Overflow. In this paper, we conduct a case study with 399 Android apps, to investigate whether developers respect license terms when reusing code from Stack Overflow posts (and the other way around). We found 232 code snippets in 62 Android apps from our dataset that were potentially reused from Stack Overflow, and 1,226 Stack Overflow posts containing code examples that are clones of code released in 68 Android apps, suggesting that developers may have copied the code of these apps to answer Stack Overflow questions. We investigated the licenses of these pieces of code and observed 1,279 cases of potential license violations (related to code posting to Stack overflow or code reuse from Stack overflow). This paper aims to raise the awareness of the software engineering community about potential unethical code reuse activities taking place on Q&A websites like Stack Overflow.
- Conference Article
24
- 10.1109/aswec.2018.00027
- Nov 1, 2018
Solutions provided in Question and Answer (Q&A) websites such as Stack Overflow are regularly used in Open Source Software (OSS). However, many developers are unaware that both Stack Overflow and OSS are governed by licenses. Hence, developers reusing code from Stack Overflow for their OSS projects may violate licensing agreements if their attributions are not correct. Additionally, if code migrates from one OSS through Stack Overflow to another OSS, then complex licensing issues are likely to exist. Such forms of software reuse also have implications for future software maintenance, particularly where developers have poor understanding of copied code. This paper investigates code reuse between these two platforms (i.e., Stack Overflow and OSS), with the aim of providing insights into this issue. This study mined 151,946 Java code snippets from Stack Overflow, 16,617 Java files from 12 of the top weekly listed projects on SourceForge and GitHub, and 39,616 Java files from the top 20 most popular Java projects on SourceForge. Our analyses were aimed at finding the number of clones (indicating reuse) (a) within Stack Overflow posts, (b) between Stack Overflow and popular Java OSS projects, and (c) between the projects. Outcomes reveal that there was up to 3.3% code reuse within Stack Overflow, while 1.0% of Stack Overflow code was reused in recent popular Java projects and 2.3% in those projects that were more established. Reuse across projects was much higher, accounting for as much as 77.2%. Our outcomes have implication for strategies aimed at introducing strict quality assurance measures to ensure the appropriateness of code reuse, and licensing requirements awareness.
- Research Article
- 10.5281/zenodo.4683732
- Oct 10, 2020
- arXiv (Cornell University)
This is the dataset, coding guides, and scripts for our paper: Broken external links on Stack Overflow.
- Conference Article
15
- 10.1145/3468264.3468582
- Aug 18, 2021
To solve programming issues, developers commonly search on Stack Overflow to seek potential solutions. However, there is a gap between the knowledge developers are interested in and the knowledge they are able to retrieve using search engines. To help developers efficiently retrieve relevant knowledge on Stack Overflow, prior studies proposed several techniques to reformulate queries and generate summarized answers. However, few studies performed a large-scale analysis using real-world search logs. In this paper, we characterize how developers search on Stack Overflow using such logs. By doing so, we identify the challenges developers face when searching on Stack Overflow and seek opportunities for the platform and researchers to help developers efficiently retrieve knowledge. To characterize search activities on Stack Overflow, we use search log data based on requests to Stack Overflow's web servers. We find that the most common search activity is reformulating the immediately preceding queries. Related work looked into query reformulations when using generic search engines and found 13 types of query reformulation strategies. Compared to their results, we observe that 71.78% of the reformulations can be fitted into those reformulation strategies. In terms of how queries are structured, 17.41% of the search sessions only search for fragments of source code artifacts (e.g., class and method names) without specifying the names of programming languages, libraries, or frameworks. Based on our findings, we provide actionable suggestions for Stack Overflow moderators and outline directions for future research. For example, we encourage Stack Overflow to set up a database that includes the relations between all computer programming terminologies shared on Stack Overflow, e.g., method name, data structure name, design pattern, and IDE name. By doing so, Stack Overflow could improve the performance of search engines by considering related programming terminologies at different levels of granularity.
- Research Article
63
- 10.1007/s10664-016-9430-z
- Apr 19, 2016
- Empirical Software Engineering
Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.
- Research Article
16
- 10.1109/access.2023.3238813
- Jan 1, 2023
- IEEE Access
Web Services (WSs) are gaining worldwide popularity due to reliable and fast intercommunication services for the development of web and mobile applications. WSs are provided to client application developers through web Application Programming Interfaces (APIs), such as YouTube API, Twitter API, Facebook API, etc. Due to the popularity of WSs, the developers frequently discuss various WSs-based application’ issues on online forums, such as Stack Overflow (SO). This study aims to highlight the problems faced by client developers in the development process of WSs-based applications using the dataset of SO. The comprehension of developers’ conversations on SO can give insight into the frequency, difficulty, and popularity of different WSs-related problems of developers. We downloaded 12,746 posts from SO relevant to WSs-related issues for this article. We used the topic modeling technique (LDA) to extract various topics from the SO dataset. The topics are labeled and organized into categories and sub-categories according to relationships among them. The difficulty and popularity of each topic have been analyzed. Our investigation yield several findings. First, developers focus on six topics related to WSs on SO: Client APIs development, Data Processing, Web services Authorization, Framework Support, Web APIs, and Mobile Applications. Secondly, the advantages and disadvantages of web applications topic (Fused_Popularity=0.39), from the Clients APIs development category have the highest prevalence, followed by Database (DB) and Data Processing in Applications topic (Fused_Popularity=0.38) from the Data Processing category. Third, most WSs-related topics in all categories are evolving promptly on SO, i.e., new questions are added daily about WSs development, deployment, and authorization. Fourth, the questions of type “how” are primarily asked in Framework support, Client APIs development, and Web APIs categories. Although, many questions in other categories are of the kind “What”. It is also observed that WSs developers not only used SO to ask How and What types of questions but they also used SO to ask information-seeking questions (i.e., in Data processing and Client APIs development categories). Fifth, the topics relevant to Web APIs (Fused_Popularity=10.8) and Client API Development ((Fused_Popularity=9.35) categories of WSs are very popular on SO. Sixth, the questions relevant to the Web APIs (Fused_Difficulty =3) & Client APIs development (Fused_Difficulty=2.25) categories are more difficult than the other four categories. The results of our research may be helpful for the following WSs stakeholders: WSs Client application developers, WSs Educators, and WSs researchers. The WSs Educators and investigators can get more knowledge of new methods and discover novel techniques to make challenging WSs topics easy to understand. WSs framework developers can utilize our extracted WSs topics and categories to know the preferences of WSs developers that may support them in upgrading existing frameworks or developing new ones.
- Research Article
101
- 10.1007/s10664-018-9650-5
- Oct 1, 2018
- Empirical Software Engineering
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.
- Research Article
59
- 10.1108/dta-07-2017-0054
- Feb 9, 2018
- Data Technologies and Applications
PurposeSoftware developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development.Design/methodology/approachA comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016.FindingsMost of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps.Practical implicationsThe work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development.Originality/valueThe study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.
- Conference Article
134
- 10.1145/3196398.3196430
- May 28, 2018
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.
- Conference Article
1
- 10.1109/qrs51102.2020.00066
- Dec 1, 2020
Stack Overflow (SO) is one of the most popular Q&A sites for not only providing valuable information to software developers but also encouraging the sharing of knowledge and problem solving. Unfortunately, the information provided by SO is not always sufficient for developers. In this paper, we empirically show how fast and effectively historical code changes can substitute for missing or unanswered SO articles. Developers in all around the world encounter many problems daily and their trial-and-error experiences to resolve the problems are accumulated in the code change history. The extracted source code differences are expected to provide valuable information to developers before the questions and answers are posted on SO. In our study, we focus on the usage of APIs as the topic of SO articles, because many developers are interested in API programming and suffer from the problems related to API usage. We extracted 4,780 code differences from 713 repositories of Android applications (F-Droid). As a result, we found that 64% of SO articles on Android APIs are related to code differences, whereas 44% of code differences are related to SO articles. Not a few code differences appear before the corresponding SO articles are actually posted. The median of time lag between the first appearance of code changes and the first actual SO postings is 22 months.
- Research Article
11
- 10.1109/tse.2021.3086494
- Sep 1, 2022
- IEEE Transactions on Software Engineering
Stack Overflow hosts valuable programming-related knowledge with 11,926,354 links that reference to the third-party websites. The links that reference to the resources hosted outside the Stack Overflow websites extend the Stack Overflow knowledge base substantially. However, with the rapid development of programming-related knowledge, many resources hosted on the Internet are not available anymore. Based on our analysis of the Stack Overflow data that was released on Jun. 2, 2019, 14.2% of the links on Stack Overflow are broken links. The broken links on Stack Overflow can obstruct viewers from obtaining desired programming-related knowledge, and potentially damage the reputation of the Stack Overflow as viewers might regard the posts with broken links as obsolete. In this paper, we characterize the broken links on Stack Overflow. 65% of the broken links in our sampled questions are used to show examples, e.g., code examples. 70% of the broken links in our sampled answers are used to provide supporting information, e.g., explaining a certain concept and describing a step to solve a problem. Only 1.67% of the posts with broken links are highlighted as such by viewers in the posts' comments. Only 5.8% of the posts with broken links removed the broken links. Viewers cannot fully rely on the vote scores to detect broken links, as broken links are common across posts with different vote scores. The websites that host resources that can be maintained by their users are referenced by broken links the most on Stack Overflow -- a prominent example of such websites is GitHub. The posts and comments related to the web technologies, i.e., JavaScript, HTML, CSS, and jQuery, are associated with more broken links. Based on our findings, we shed lights for future directions and provide recommendations for practitioners and researchers.
- Research Article
20
- 10.1109/tse.2020.2981898
- Feb 5, 2020
- IEEE Transactions on Software Engineering
Software engineering is knowledge-intensive and requires software developers to continually search for knowledge, often on community question answering platforms such as Stack Overflow. Such information sharing platforms do not exist in isolation, and part of the evidence that they exist in a broader software documentation ecosystem is the common presence of hyperlinks to other documentation resources found in forum posts. With the goal of helping to improve the information diffusion between Stack Overflow and other documentation resources, we conducted a study to answer the question of how and why documentation is referenced in Stack Overflow threads. We sampled and classified 759 links from two different domains, regular expressions and Android development, to qualitatively and quantitatively analyze the links’ context and purpose, including attribution, awareness, and recommendations. We found that links on Stack Overflow serve a wide range of distinct purposes, ranging from citation links attributing content copied into Stack Overflow, over links clarifying concepts using Wikipedia pages, to recommendations of software components and resources for background reading. This purpose spectrum has major corollaries, including our observation that links to documentation resources are a reflection of the information needs typical to a technology domain. We contribute a framework and method to analyze the context and purpose of Stack Overflow links, a public dataset of annotated links, and a description of five major observations about linking practices on Stack Overflow. Those observations include the above-mentioned purpose spectrum, its interplay with documentation resources and applications domains, and the fact that links on Stack Overflow often lack context in form of accompanying quotes or summaries. We further point to potential tool support to enhance the information diffusion between Stack Overflow and other documentation resources.
- Conference Article
19
- 10.1109/msr.2019.00047
- May 1, 2019
Stack Overflow (SO) is a popular Q&A forum for software developers, providing a large number of copyable code snippets. While GitHub is an independent code collaboration platform, developers often reuse SO code in their GitHub projects. In this paper, we investigate how often GitHub developers re-use code snippets from the SO forum, as well as what concepts they are more likely to reference in their code. To accomplish our goal, we mine SOTorrent dataset that provides connectivity between code snippets on the SO posts with software projects hosted on GitHub. We then study the characteristics of GitHub projects that reference SO posts and what popular SO discussions can be found in GitHub projects. Our results demonstrate that on average developers make 45 references to SO posts in their projects, with the highest number of references being made within the JavaScript code. We also found that 79% of the SO posts with code snippets that are referenced in GitHub code do change over time (at least ones) raising code maintainability and reliability concerns.
- Research Article
7
- 10.1007/s10664-022-10242-2
- Nov 24, 2022
- Empirical Software Engineering
The content quality of shared knowledge in Stack Overflow (SO) is crucial in supporting software developers with their programming problems. Thus, SO allows its users to suggest edits to improve the quality of a post (i.e., question and answer). However, existing research shows that many suggested edits in SO are rejected due to undesired contents/formats or violating edit guidelines. Such a scenario frustrates or demotivates users who would like to conduct good-quality edits. Therefore, our research focuses on assisting SO users by offering them suggestions on how to improve their editing of posts. First, we manually investigate 764 (382 questions + 382 answers) rejected edits by rollbacks and produce a catalog of 19 rejection reasons. Second, we extract 15 texts and user-based features to capture those rejection reasons. Third, we develop four machine learning models using those features. Our best-performing model can predict rejected edits with 69.1% precision, 71.2% recall, 70.1% F1-score, and 69.8% overall accuracy. Fourth, we introduce an online tool named EditEx that works with the SO edit system. EditEx can assist users while editing posts by suggesting the potential causes of rejections. We recruit 20 participants to assess the effectiveness of EditEx. Half of the participants (i.e., treatment group) use EditEx and another half (i.e., control group) use the SO standard edit system to edit posts. According to our experiment, EditEx can support SO standard edit system to prevent 49% of rejected edits, including the commonly rejected ones. However, it can prevent 12% rejections even in free-form regular edits. The treatment group finds the potential rejection reasons identified by EditEx influential. Furthermore, the median workload suggesting edits using EditEx is half compared to the SO edit system.