An exploratory study on the repeatedly shared external links on Stack Overflow
On Stack Overflow, users reuse 11,926,354 external links to share the resources hosted outside the Stack Overflow website. The external links connect to the existing programming-related knowledge and extend the crowdsourced knowledge on Stack Overflow. Some of the external links, so-called as repeated external links, can be shared for multiple times. We observe that 82.5% of the link sharing activities (i.e., sharing links in any question, answer, or comment) on Stack Overflow share external resources, and 57.0% of the occurrences of the external links are sharing the repeated external links. However, it is still unclear what types of external resources are repeatedly shared. To help users manage their knowledge, we wish to investigate the characteristics of the repeated external links in knowledge sharing on Stack Overflow. In this paper, we analyze the repeated external links on Stack Overflow. We observe that external links that point to the text resources (hosted in documentation websites, tutorial websites, etc.) are repeatedly shared the most. We observe that: 1) different users repeatedly share the same knowledge in the form of repeated external links, thus increasing the maintenance effort of knowledge (e.g., update invalid links in multiple posts), 2) the same users can repeatedly share the external links for the purpose of promotion, and 3) external links can point to webpages with an overload of information that is difficult for users to retrieve relevant information. Our findings provide insights to Stack Overflow moderators and researchers. For example, we encourage Stack Overflow to centrally manage the commonly occurring knowledge in the form of repeated external links in order to better maintain the crowdsourced knowledge on Stack Overflow.
- Research Article
63
- 10.1007/s10664-016-9430-z
- Apr 19, 2016
- Empirical Software Engineering
Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.
- Conference Article
11
- 10.5555/2820518.2820573
- May 16, 2015
Stack Overflow's primary goal is to serve as a platform for users to solicit answers regarding programming questions, though its archives are often used by other users who face similar issues and thus it serves a secondary purpose of documenting common problems. The two driving mechanisms for filtering out low quality posts and highlighting the best answers are community votes and the mark of acceptance by the original question asker. But does the asker's choice always match the popular vote? If so, is the asker's choice influenced by the community vote or is the community vote biased towards the accepted answer? And if the asker and community disagree, then can we determine any particular characteristics of posts that influence the choice of the asker and community differently, such as its size, readability, presence of code snippets and external links as well as similarity to the original question? In this paper, we explore the answers to these questions by studying a data-set of all posts on Stack Overflow from its launch in September 2008 to September 2014.
- Research Article
53
- 10.1109/tse.2019.2954319
- Dec 5, 2019
- IEEE Transactions on Software Engineering
Stack Overflow is one of the most active communities for developers to share their programming knowledge. Answers posted on Stack Overflow help developers solve issues during software development. In addition to posting answers, users can also post comments to further discuss their associated answers. As of Aug 2017, there are 32.3 million comments that are associated with answers, forming a large collection of crowdsourced repository of knowledge on top of the commonly-studied Stack Overflow answers. In this study, we wish to understand how the commenting activities contribute to the crowdsourced knowledge. We investigate what users discuss in comments, and analyze the characteristics of the commenting dynamics, (i.e., the timing of commenting activities and the roles of commenters). We find that: 1) the majority of comments are informative and thus can enhance their associated answers from a diverse range of perspectives. However, some comments contain content that is discouraged by Stack Overflow. 2) The majority of commenting activities occur after the acceptance of an answer. More than half of the comments are fast responses occurring within one day of the creation of an answer, while later comments tend to be more informative. Most comments are rarely integrated back into their associated answers, even though such comments are informative. 3) Insiders (i.e., users who posted questions/answers before posting a comment in a question thread) post the majority of comments within one month, and outsiders (i.e., users who never posted any question/answer before posting a comment) post the majority of comments after one month. Inexperienced users tend to raise limitations and concerns while experienced users tend to enhance the answer through commenting. Our study provides insights into the commenting activities in terms of their content, timing, and the individuals who perform the commenting. For the purpose of long-term knowledge maintenance and effective information retrieval for developers, we also provide actionable suggestions to encourage Stack Overflow users/engineers/moderators to leverage our insights for enhancing the current Stack Overflow commenting system for improving the maintenance and organization of the crowdsourced knowledge.
- Research Article
- 10.5281/zenodo.4683732
- Oct 10, 2020
- arXiv (Cornell University)
This is the dataset, coding guides, and scripts for our paper: Broken external links on Stack Overflow.
- Conference Article
314
- 10.1109/socialcom.2013.35
- Sep 1, 2013
Stack Overflow is a popular on-line programming question and answer community providing its participants with rapid access to knowledge and expertise of their peers, especially benefitting coders. Despite the popularity of Stack Overflow, its role in the work cycle of open-source developers is yet to be understood: on the one hand, participation in it has the potential to increase the knowledge of individual developers thus improving and speeding up the development process. On the other hand, participation in Stack Overflow may interrupt the regular working rhythm of the developer, hence also possibly slow down the development process. In this paper we investigate the interplay between Stack Overflow activities and the development process, reflected by code changes committed to the largest social coding repository, GitHub. Our study shows that active GitHub committers ask fewer questions and provide more answers than others. Moreover, we observe that active Stack Overflow askers distribute their work in a less uniform way than developers that do not ask questions. Finally, we show that despite the interruptions incurred, the Stack Overflow activity rate correlates with the code changing activity in GitHub.
- Conference Article
62
- 10.1109/scam.2015.7335404
- Sep 1, 2015
Recently, automatic code comment generation is proposed to facilitate program comprehension. Existing code comment generation techniques focus on describing the functionality of the source code. However, there are other aspects such as insights about quality or issues of the code, which are overlooked by earlier approaches. In this paper, we describe a mining approach that recommends insightful comments about the quality, deficiencies or scopes for further improvement of the source code. First, we conduct an exploratory study that motivates crowdsourced knowledge from Stack Overflow discussions as a potential resource for source code comment recommendation. Second, based on the findings from the exploratory study, we propose a heuristic-based technique for mining insightful comments from Stack Overflow Q & A site for source code comment recommendation. Experiments with 292 Stack Overflow code segments and 5,039 discussion comments show that our approach has a promising recall of 85.42%. We also conducted a complementary user study which confirms the accuracy and usefulness of the recommended comments.
- Research Article
10
- 10.1007/s11280-018-0621-y
- Jul 23, 2018
- World Wide Web
Software developers need access to correlated information (e.g., API documentation, Wikipedia pages, Stack Overflow questions and answers) which are often dispersed among different Web resources. This paper is concerned with the situation where a developer is visiting a Web page, but at the same time is willing to explore correlated Web resources to extend his/her knowledge or to satisfy his/her curiosity. Specifically, we present an item-based collaborative filtering technique, named LinkLive, for automatically recommending a list of correlated Web resources for a particular Web page. The recommendation is done by exploiting hyperlink associations from the crowdsourced knowledge on Stack Overflow. We motivate our research using an exploratory study of hyperlink dissemination patterns on Stack Overflow. We then present our LinkLive technique that uses multiple features, including hyperlink co-occurrences in Q&A discussions, locations (e.g., question, answer, or comment) in which hyperlinks are referenced, and votes for posts/comments in which hyperlinks are referenced. Experiments using 7 years of Stack Overflow data show that, our technique recommends correlated Web resources with promising accuracy in an open setting. A user study of 6 participants suggests that practitioners find the recommended Web resources useful for Web discovery.
- Conference Article
57
- 10.1109/icsme.2018.00057
- Sep 1, 2018
Software developers frequently issue generic natural language queries for code search while using code search engines (e.g., GitHub native search, Krugle). Such queries often do not lead to any relevant results due to vocabulary mismatch problems. In this paper, we propose a novel technique that automatically identifies relevant and specific API classes from Stack Overflow Q & A site for a programming task written as a natural language query, and then reformulates the query for improved code search. We first collect candidate API classes from Stack Overflow using pseudo-relevance feedback and two term weighting algorithms, and then rank the candidates using Borda count and semantic proximity between query keywords and the API classes. The semantic proximity has been determined by an analysis of 1.3 million questions and answers of Stack Overflow. Experiments using 310 code search queries report that our technique suggests relevant API classes with 48% precision and 58% recall which are 32% and 48% higher respectively than those of the state-of-the-art. Comparisons with two state-of-the-art studies and three popular search engines (e.g., Google, Stack Overflow, and GitHub native search) report that our reformulated queries (1) outperform the queries of the state-of-the-art, and (2) significantly improve the code search results provided by these contemporary search engines.
- Conference Article
21
- 10.1145/2961111.2962585
- Sep 8, 2016
Context: Recently, more and more developer communities are abandoning their legacy support forums, moving onto Stack Overflow. The motivations are diverse, yet they typically include achieving faster response time and larger visibility through the access to a modern and very successful infrastructure. One downside of migration, however, is that the history and the crowdsourced knowledge hosted at previous sites remain separated or even get lost if a community decides to abandon completely the legacy developer forum.Goal: Adding to the body of evidence of existing research on best-answer prediction, here we show that, from a technical perspective, the content from existing developer forums might be automatically migrated to the Stack Overflow, although most of forums do not allow to mark a question as resolved, a distinctive feature of modern Q&A sites.Method: We trained a binary classifier with data from Stack Overflow and then tested it with data scraped from Docusign, a developer forum that has recently completed the move.Results: Our findings show that best answers can be predicted with a good accuracy, only relying on shallow linguistic (text) features, such as answer length and the number of sentences, combined with other features like answer upvotes and age, which can be easily computed in near real-time.Conclusions: Results provide an initial yet positive evidence towards the automatic migration of crowdsourced knowledge from legacy forums to modern Q&A sites.
- Research Article
11
- 10.1016/j.jss.2023.111608
- Jan 5, 2023
- Journal of Systems and Software
Characterizing architecture related posts and their usefulness in Stack Overflow
- Research Article
47
- 10.1109/tse.2021.3058985
- Feb 23, 2021
- IEEE Transactions on Software Engineering
Stack Overflow hosts millions of solutions that aim to solve developers' programming issues. In this crowdsourced question answering process, Stack Overflow becomes a code hosting website where developers actively share its code. However, code snippets on Stack Overflow may contain security vulnerabilities, and if shared carelessly, such snippets can introduce security problems in software systems. In this paper, we empirically study the prevalence of the <i>Common Weakness Enumeration</i> – CWE, in code snippets of C/C++ related answers. We explore the characteristics of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , i.e., code snippets that have CWE instances, in terms of the types of weaknesses, the evolution of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , and who contributed such code snippets. We find that: 1) 36 percent (i.e., 32 out of 89) CWE types are detected in <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> on Stack Overflow. Particularly, CWE-119, i.e., <i>improper restriction of operations within the bounds of a memory buffer</i> , is common in both answer code snippets and real-world software systems. Furthermore, the proportion of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> doubled from 2008 to 2018 after normalizing by the total number of C/C++ snippets in each year. 2) In general, code revisions are associated with a reduction in the number of code weaknesses. However, the majority of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> had weaknesses introduced in the first version of the code, and these <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> were never revised since then. Only 7.5 percent of users who contributed C/C++ code snippets posted or edited code with weaknesses. Users contributed less code with CWE weakness when they were more active (i.e., they either revised more code snippets or had a higher reputation). We also find that some users tended to have the same CWE type repeatedly in their various code snippets. Our empirical study provides insights to users who share code snippets on Stack Overflow so that they are aware of the potential security issues. To understand the community feedback about improving code weaknesses by answer revisions, we also conduct a qualitative study and find that 62.5 percent of our suggested revisions are adopted by the community. Stack Overflow can perform CWE scanning for all the code that is hosted on its platform. Further research is needed to improve the quality of the crowdsourced knowledge on Stack Overflow.
- Research Article
141
- 10.1016/j.infsof.2017.10.009
- Nov 6, 2017
- Information and Software Technology
How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow
- Research Article
55
- 10.1007/s10664-018-9671-0
- Jan 21, 2019
- Empirical Software Engineering
Traditional code search engines (e.g., Krugle) often do not perform well with natural language queries. They mostly apply keyword matching between query and source code. Hence, they need carefully designed queries containing references to relevant APIs for the code search. Unfortunately, preparing an effective search query is not only challenging but also time-consuming for the developers according to existing studies. In this article, we propose a novel query reformulation technique–RACK–that suggests a list of relevant API classes for a natural language query intended for code search. Our technique offers such suggestions by exploiting keyword-API associations from the questions and answers of Stack Overflow (i.e., crowdsourced knowledge). We first motivate our idea using an exploratory study with 19 standard Java API packages and 344K Java related posts from Stack Overflow. Experiments using 175 code search queries randomly chosen from three Java tutorial sites show that our technique recommends correct API classes within the Top-10 results for 83% of the queries, with 46% mean average precision and 54% recall, which are 66%, 79% and 87% higher respectively than that of the state-of-the-art. Reformulations using our suggested API classes improve 64% of the natural language queries and their overall accuracy improves by 19%. Comparisons with three state-of-the-art techniques demonstrate that RACK outperforms them in the query reformulation by a statistically significant margin. Investigation using three web/code search engines shows that our technique can significantly improve their results in the context of code search.
- Conference Article
63
- 10.1109/icse.2019.00065
- May 1, 2019
Stack Overflow (SO) is the most popular online Q&A site for developers to share their expertise in solving programming issues. Given multiple answers to certain questions, developers may take the accepted answer, the answer from a person with high reputation, or the one frequently suggested. However, researchers recently observed exploitable security vulnerabilities in popular SO answers. This observation inspires us to explore the following questions: How much can we trust the security implementation suggestions on SO? If suggested answers are vulnerable, can developers rely on the community's dynamics to infer the vulnerability and identify a secure counterpart? To answer these highly important questions, we conducted a study on SO posts by contrasting secure and insecure advices with the community-given content evaluation. We investigated whether SO incentive mechanism is effective in improving security properties of distributed code examples. Moreover, we also traced duplicated answers to assess whether the community behavior facilitates propagation of secure and insecure code suggestions. We compiled 953 different groups of similar security-related code examples and labeled their security, identifying 785 secure answer posts and 644 insecure ones. Compared with secure suggestions, insecure ones had higher view counts (36,508 vs. 18,713), received a higher score (14 vs. 5), and had significantly more duplicates (3.8 vs. 3.0) on average. 34% of the posts provided by highly reputable so-called trusted users were insecure. Our findings show that there are lots of insecure snippets on SO, while the community-given feedback does not allow differentiating secure from insecure choices. Moreover, the reputation mechanism fails in indicating trustworthy users with respect to security questions, ultimately leaving other users wandering around alone in a software security minefield.
- Research Article
136
- 10.1007/s10664-018-9634-5
- Jul 4, 2018
- Empirical Software Engineering
Technical question and answer Q&A platforms, such as Stack Overflow, provide a platform for users to ask and answer questions about a wide variety of programming topics. These platforms accumulate a large amount of knowledge, including hundreds of thousands lines of source code. Developers can benefit from the source code that is attached to the questions and answers on Q&A platforms by copying or learning from (parts of) it. By understanding how developers utilize source code from Q&A platforms, we can provide insights for researchers which can be used to improve next-generation Q&A platforms to help developers reuse source code fast and easily. In this paper, we first conduct an exploratory study on 289 files from 182 open-source projects, which contain source code that has an explicit reference to a Stack Overflow post. Our goal is to understand how developers utilize code from Q&A platforms and to reveal barriers that may make code reuse more difficult. In 31.5% of the studied files, developers needed to modify source code from Stack Overflow to make it work in their own projects. The degree of required modification varied from simply renaming variables to rewriting the whole algorithm. Developers sometimes chose to implement an algorithm from scratch based on the descriptions from Stack Overflow answers, even if there was an implementation readily available in the post. In 35.5% of the studied files, developers used Stack Overflow posts as an information source for later reference. To further understand the barriers of reusing code and to obtain suggestions for improving the code reuse process on Q&A platforms, we conducted a survey with 453 open-source developers who are also on Stack Overflow. We found that the top 3 barriers that make it difficult for developers to reuse code from Stack Overflow are: (1) too much code modification required to fit in their projects, (2) incomprehensive code, and (3) low code quality. We summarized and analyzed all survey responses and we identified that developers suggest improvements for future Q&A platforms along the following dimensions: code quality, information enhancement & management, data organization, license, and the human factor. For instance, developers suggest to improve the code quality by adding an integrated validator that can test source code online, and an outdated code detection mechanism. Our findings can be used as a roadmap for researchers and developers to improve code reuse.