Contextual Documentation Referencing on Stack Overflow
Software engineering is knowledge-intensive and requires software developers to continually search for knowledge, often on community question answering platforms such as Stack Overflow. Such information sharing platforms do not exist in isolation, and part of the evidence that they exist in a broader software documentation ecosystem is the common presence of hyperlinks to other documentation resources found in forum posts. With the goal of helping to improve the information diffusion between Stack Overflow and other documentation resources, we conducted a study to answer the question of how and why documentation is referenced in Stack Overflow threads. We sampled and classified 759 links from two different domains, regular expressions and Android development, to qualitatively and quantitatively analyze the links’ context and purpose, including attribution, awareness, and recommendations. We found that links on Stack Overflow serve a wide range of distinct purposes, ranging from citation links attributing content copied into Stack Overflow, over links clarifying concepts using Wikipedia pages, to recommendations of software components and resources for background reading. This purpose spectrum has major corollaries, including our observation that links to documentation resources are a reflection of the information needs typical to a technology domain. We contribute a framework and method to analyze the context and purpose of Stack Overflow links, a public dataset of annotated links, and a description of five major observations about linking practices on Stack Overflow. Those observations include the above-mentioned purpose spectrum, its interplay with documentation resources and applications domains, and the fact that links on Stack Overflow often lack context in form of accompanying quotes or summaries. We further point to potential tool support to enhance the information diffusion between Stack Overflow and other documentation resources.
- Research Article
8
- 10.1109/te.2021.3123889
- Aug 1, 2022
- IEEE Transactions on Education
Contribution: Determine and analyze the gap between software practitioners' education outlined in the 2014IEEE/ACM Software Engineering Education Knowledge (SEEK) and industrial needs pointed by Wikipedia articles referenced in Stack Overflow (SO) posts. Background: Previous work has uncovered deficiencies in the coverage of computer fundamentals, people skills, software processes, and human-computer interaction, suggesting rebalancing. Research Questions: 1) To what extent are developers' needs, in terms of Wikipedia articles referenced in SO posts, covered by the SEEK knowledge units? 2) How does the popularity of Wikipedia articles relate to their SEEK coverage? 3) What areas of computing knowledge can be better covered by the SEEK knowledge units? 4) Why are Wikipedia articles covered by the SEEK knowledge units cited on SO? Methodology: Wikipedia articles were systematically collected from SO posts. The most cited were manually mapped to the SEEK knowledge units, assessed according to their degree of coverage. Articles insufficiently covered by the SEEK were classified by hand using the 2012 ACM Computing Classification System. A sample of posts referencing sufficiently covered articles was manually analyzed. A survey was conducted on software practitioners to validate the study findings. Findings: SEEK appears to cover sufficiently computer science fundamentals, software design and mathematical concepts, but less so areas like the World Wide Web, software engineering components, and computer graphics. Developers seek advice, best practices and explanations about software topics, and code review assistance. Future SEEK models and the computing education could dive deeper in information systems, design, testing, security, and soft skills.
- Research Article
77
- 10.1007/s10664-019-09758-x
- Aug 28, 2019
- Empirical Software Engineering
On question and answer sites, such as Stack Overflow (SO), developers use tags to label the content of a post and to support developers in question searching and browsing. However, these tags mainly refer to technological aspects instead of the purpose of the question. Tagging questions with their purpose can add a new dimension to the identification of discussed topics in posts on SO. In this paper, we aim at automating the classification of SO question posts into seven question categories. As a first step, we harmonized existing taxonomies of question categories and then, we manually classified 1,000 SO questions according to our new taxonomy. Additionally to the question category, we marked the phrases that indicate a question category for each of the posts. We then use this data set to automate the classification of posts using two approaches. For the first approach, we manually analyzed the phrases to find patterns. Based on regular expressions, we implemented a classifier, for each of the categories, that determines whether a post belongs to a category. These regular expressions are derived by analyzing patterns in the phrases. In the second approach, we use the curated data set to train classification models of supervised machine learning algorithms (Random Forest and Support Vector Machines). For the machine learning algorithms, we experimented with 1,312 different configurations regarding the preprocessing of the text and the representation of the input data. Then, we compared the performance of the regex approach with the performance of the best configuration that uses machine learning algorithms on a validation set of 110 posts. The results show that using the regular expression approach, we can classify posts into the correct question category with an average precision and recall of 0.90, and an MCC of 0.68. Additionally, we applied the regex approach on all questions of SO that deal with Android app development and investigated the co-occurrence of question categories in posts. We found that the categories API usage, Conceptual, and Discrepancy are the most frequently assigned question categories and that they also occur together frequently. Our approach can be used to support developers in browsing SO discussions or researchers in building recommender systems based on SO.
- Conference Article
274
- 10.1109/sp.2016.25
- May 1, 2016
Vulnerabilities in Android code -- including but not limited to insecure data storage, unprotected inter-component communication, broken TLS implementations, and violations of least privilege -- have enabled real-world privacy leaks and motivated research cataloguing their prevalence and impact. Researchers have speculated that appification promotes security problems, as it increasingly allows inexperienced laymen to develop complex and sensitive apps. Anecdotally, Internet resources such as Stack Overflow are blamed for promoting insecure solutions that are naively copy-pasted by inexperienced developers. In this paper, we for the first time systematically analyzed how the use of information resources impacts code security. We first surveyed 295 app developers who have published in the Google Play market concerning how they use resources to solve security-related problems. Based on the survey results, we conducted a lab study with 54 Android developers (students and professionals), in which participants wrote security-and privacy-relevant code under time constraints. The participants were assigned to one of four conditions: free choice of resources, Stack Overflow only, official Android documentation only, or books only. Those participants who were allowed to use only Stack Overflow produced significantly less secure code than those using, the official Android documentation or books, while participants using the official Android documentation produced significantly less functional code than those using Stack Overflow. To assess the quality of Stack Overflow as a resource, we surveyed the 139 threads our participants accessed during the study, finding that only 25% of them were helpful in solving the assigned tasks and only 17% of them contained secure code snippets. In order to obtain ground truth concerning the prevalence of the secure and insecure code our participants wrote in the lab study, we statically analyzed a random sample of 200,000 apps from Google Play, finding that 93.6% of the apps used at least one of the API calls our participants used during our study. We also found that many of the security errors made by our participants also appear in the wild, possibly also originating in the use of Stack Overflow to solve programming problems. Taken together, our results confirm that API documentation is secure but hard to use, while informal documentation such as Stack Overflow is more accessible but often leads to insecurity. Given time constraints and economic pressures, we can expect that Android developers will continue to choose those resources that are easiest to use, therefore, our results firmly establish the need for secure-but-usable documentation.
- Research Article
11
- 10.1016/j.jss.2023.111608
- Jan 5, 2023
- Journal of Systems and Software
Characterizing architecture related posts and their usefulness in Stack Overflow
- Research Article
4
- 10.1049/2023/6613434
- Jan 1, 2023
- IET Software
Mobile applications are continuously increasing in prevalence. One of the main challenges in mobile application development is creating cross‐platform applications. To facilitate developing cross‐platform applications, the software engineering community created several solutions, one of which is React Native (RN), which is a popular cross‐platform framework. The software engineering literature demonstrated the effectiveness of Stack Overflow (SO) in providing real‐world perspectives on a variety of technical subjects. Therefore, this study aims to gain a better understanding of the stance of RN on SO. We identified and analyzed 131,620 SO RN‐related questions. Moreover, we observed how the interest toward RN on SO evolves over time. Additionally, we utilized Latent Dirichlet Allocation (LDA) to identify RN‐related topics that are discussed within the questions. Afterward, we utilized a number of proxy measures to estimate the popularity and difficulty of these topics. The results revealed that interest toward RN on SO was generally increasing. Moreover, RN‐related questions revolve around six topics, with the topics of layout and navigation being the most popular and the topic of iOS issues being the most difficult. Software engineering researchers, practitioners, educators, and RN contributors may find the results of this study beneficial in guiding their future RN efforts.
- Research Article
141
- 10.1016/j.infsof.2017.10.009
- Nov 6, 2017
- Information and Software Technology
How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow
- Conference Article
90
- 10.1109/saner.2017.7884629
- Feb 1, 2017
Developers use Question and Answer (Q&A) websites to exchange knowledge and expertise. Stack Overflow is a popular Q&A website where developers discuss coding problems and share code examples. Although all Stack Overflow posts are free to access, code examples on Stack Overflow are governed by the Creative Commons Attribute-ShareAlike 3.0 Unported license that developers should obey when reusing code from Stack Overflow or posting code to Stack Overflow. In this paper, we conduct a case study with 399 Android apps, to investigate whether developers respect license terms when reusing code from Stack Overflow posts (and the other way around). We found 232 code snippets in 62 Android apps from our dataset that were potentially reused from Stack Overflow, and 1,226 Stack Overflow posts containing code examples that are clones of code released in 68 Android apps, suggesting that developers may have copied the code of these apps to answer Stack Overflow questions. We investigated the licenses of these pieces of code and observed 1,279 cases of potential license violations (related to code posting to Stack overflow or code reuse from Stack overflow). This paper aims to raise the awareness of the software engineering community about potential unethical code reuse activities taking place on Q&A websites like Stack Overflow.
- Research Article
63
- 10.1007/s10664-016-9430-z
- Apr 19, 2016
- Empirical Software Engineering
Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.
- Research Article
60
- 10.1108/dta-07-2017-0054
- Feb 9, 2018
- Data Technologies and Applications
PurposeSoftware developers extensively use stack overflow (SO) for knowledge sharing on software development. Thus, software engineering researchers have started mining the structured/unstructured data present in certain software repositories including the Q&A software developer community SO, with the aim to improve software development. The purpose of this paper is show that how academics/practitioners can get benefit from the valuable user-generated content shared on various online social networks, specifically from Q&A community SO for software development.Design/methodology/approachA comprehensive literature review was conducted and 166 research papers on SO were categorized about software development from the inception of SO till June 2016.FindingsMost of the studies revolve around a limited number of software development tasks; approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods, and conducted investigations semi-automatically and quantitative studies. Thus, future research should focus on the overcoming existing identified challenges and gaps.Practical implicationsThe work on SO is classified into two main categories; “SO design and usage” and “SO content applications.” These categories not only give insights to Q&A forum providers about the shortcomings in design and usage of such forums but also provide ways to overcome them in future. It also enables software developers to exploit such forums for the identified under-utilized tasks of software development.Originality/valueThe study is the first of its kind to explore the work on SO about software development and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of Q&A sites, and future research challenges.
- Research Article
1
- 10.1142/s0218194023500274
- Jun 26, 2023
- International Journal of Software Engineering and Knowledge Engineering
Stack Overflow is a Q&A website that is popular among developers and extensively used in software engineering (SE) research. A significant body of research has examined how Stack Overflow can assist with software development tasks, such as recommending APIs. However, while researchers have recognized the importance of Stack Overflow in SE research related to software development tasks, the specific ways in which it is utilized and the reasons for its widespread usage in research have not been thoroughly explored. To address these knowledge gaps, we conducted the first study to understand the role of Stack Overflow in assisting with SE research regarding software development tasks by systematically examining relevant and high-quality research works. Meanwhile, we carried out a qualitative survey to gain insight into why researchers choose to utilize Stack Overflow in SE research and to solicit suggestions for the better use of Stack Overflow in research. The study identifies trends in the research area, prominent researchers and organizations, and the types of tasks that utilize Stack Overflow in research, with coding and debugging being the most common. Moreover, it examines how Stack Overflow data is utilized in SE research regarding software development tasks, including searching, training models, and mining associations. Our qualitative survey of researchers indicates that the popularity of Stack Overflow stems from its comprehensive explanations of technical topics that are often not found in documentation or manuals. The findings provide a comprehensive understanding of the role of Stack Overflow in SE research regarding software development tasks, and offer actionable implications for both researchers and stakeholders of Stack Overflow to facilitate future research and improvements.
- Conference Article
224
- 10.1109/msr.2013.6624015
- May 1, 2013
Community-based question answering services accumulate large volumes of knowledge through the voluntary services of people across the globe. Stack Overflow is an example of such a service that targets developers and software engineers. In general, questions in Stack Overflow are answered in a very short time. However, we found that the number of unanswered questions has increased significantly in the past two years. Understanding why questions remain unanswered can help information seekers improve the quality of their questions, increase their chances of getting answers, and better decide when to use Stack Overflow services. In this paper, we mine data on unanswered questions from Stack Overflow. We then conduct a qualitative study to categorize unanswered questions, which reveals characteristics that would be difficult to find otherwise. Finally, we conduct an experiment to determine whether we can predict how long a question will remain unanswered in Stack Overflow.
- Research Article
1
- 10.1049/sfw2/1905538
- Jan 1, 2024
- IET Software
Web‐based applications are popular in demand and usage. To facilitate the development of web‐based applications, the software engineering community developed multiple web application frameworks, one of which is Flask. Flask is a popular web framework that allows developers to speed up and scale the development of web applications. A review of the software engineering literature revealed that the Stack Overflow (SO) website has proven its effectiveness in providing a better understanding of multiple subjects within the software engineering field. This study aims to analyze SO Flask‐related questions to gain a better understanding of the stance of Flask on the website. We identified a set of 70,230 Flask‐related questions that we further analyzed to estimate how the interest towards the framework evolved over time on the website. Afterward, we utilized the Latent Dirichlet Allocation (LDA) algorithm to identify Flask‐related topics that are discussed within the set of the identified questions. Moreover, we leveraged a number of proxy measures to examine the difficulty and popularity of the identified topics. The study found that the interest towards Flask has been generally increasing on the website, with a peak in 2020 and drops in the following years. Moreover, Flask‐related questions on SO revolve around 12 topics, where Application Programming Interface (API) can be considered the most popular topic and background tasks can be considered the most difficult one. Software engineering researchers, practitioners, educators, and Flask contributors may find this study useful in guiding their future Flask‐related endeavors.
- Research Article
40
- 10.1016/j.infsof.2021.106667
- Nov 1, 2021
- Information and Software Technology
On the value of encouraging gender tolerance and inclusiveness in software engineering communities
- Conference Article
- 10.1145/3756681.3757002
- Jun 17, 2025
Community-driven forums like Stack Overflow (SO) have long established themselves as the go-to platform for developers seeking online help. Recently, ChatGPT, a powerful AI tool capable of generating high-level code and providing detailed explanations, has emerged as a strong alternative. While both platforms are valuable for developers, determining the best choice for specific use cases remains an open challenge. Although previous studies have examined the comparative merits of these platforms, the datasets used in such evaluations were limited. To bridge this gap, we introduce a four-dimensional benchmark dataset, ‘SEED’, that can facilitate a comprehensive analysis of ChatGPT and Stack Overflow. Our dataset comprises: (i) Developer Sentiments mined from 4161 comments from Reddit and SO meta-discussions, indicating community perceptions of both platforms, along with a manually labeled subset of 1,000 comments capturing developers’ expressed preferences; (ii) 3500 technical questions from SO, their accepted answers, and corresponding ChatGPT-generated responses for Efficacy (accuracy) benchmarking; (iii) An additional 200 deep learning-related SO posts, their accepted answers, and the corresponding ChatGPT answers to evaluate both these platforms on Energy efficiency parameters; (iv) 4,500 ChatGPT code snippets generated using tailor-made prompts designed to mimic SO answers for Detecting AI-code plagiarism. SEED can support diverse applications, including benchmarking AI-generated answers, evaluating energy efficiency in deep learning development, detecting AI plagiarism, and analyzing developer sentiment. By making this dataset publicly available, we lay the seed for advancing the research involving human-AI interaction in software engineering. Our dataset can be accessed at https://github.com/AnonymousResearch173/SEED.
- Conference Article
157
- 10.1145/2597008.2597155
- Jun 2, 2014
The growing number of questions related to mobile development in StackOverflow highlights an increasing interest of software developers in mobile programming. For the Android platform, 213,836 questions were tagged with Android-related labels in StackOverflow between July 2008 and August 2012. This paper aims at investigating how changes occurring to Android APIs trigger questions and activity in StackOverflow, and whether this is particularly true for certain kinds of changes. Our findings suggest that Android developers usually have more questions when the behavior of APIs is modified. In addition, deleting public methods from APIs is a trigger for questions that are (i) more discussed and of major interest for the community, and (ii) posted by more experienced developers. In general, results of this paper provide important insights about the use of social media to learn about changes in software ecosystems, and establish solid foundations for building new recommenders for notifying developers/managers about important changes and recommending them relevant crowdsourced solutions