The structure and dynamics of knowledge network in domain-specific Q&A sites: a case study of stack overflow

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A site, forming a large knowledge network. In Stack Overflow, why do developers share URLs? How is the community feedback to the knowledge being shared? What are the unique topological and semantic properties of the resulting knowledge network in Stack Overflow? Has this knowledge network become stable? If so, how does it reach to stability? Answering these questions can help the software engineering community better understand the knowledge diffusion process in programming-specific Q&A sites like Stack Overflow, thereby enabling more effective knowledge sharing, knowledge use, and knowledge representation and search in the community. Previous work has focused on analyzing user activities in Q&A sites or mining the textual content of these sites. In this article, we present a methodology to analyze URL sharing activities in Stack Overflow. We use open coding method to analyze why users share URLs in Stack Overflow, and develop a set of quantitative analysis methods to study the structural and dynamic properties of the emergent knowledge network in Stack Overflow. We also identify system designs, community norms, and social behavior theories that help explain our empirical findings. Through this study, we obtain an in-depth understanding of the knowledge diffusion process in Stack Overflow and expose the implications of URL sharing behavior for Q&A site design, developers who use crowdsourced knowledge in Stack Overflow, and future research on knowledge representation and search.

Similar Papers
  • Research Article
  • Cite Count Icon 7
  • 10.1007/s10664-021-10028-y
An exploratory study on the repeatedly shared external links on Stack Overflow
  • Nov 3, 2021
  • Empirical Software Engineering
  • Jiakun Liu + 6 more

On Stack Overflow, users reuse 11,926,354 external links to share the resources hosted outside the Stack Overflow website. The external links connect to the existing programming-related knowledge and extend the crowdsourced knowledge on Stack Overflow. Some of the external links, so-called as repeated external links, can be shared for multiple times. We observe that 82.5% of the link sharing activities (i.e., sharing links in any question, answer, or comment) on Stack Overflow share external resources, and 57.0% of the occurrences of the external links are sharing the repeated external links. However, it is still unclear what types of external resources are repeatedly shared. To help users manage their knowledge, we wish to investigate the characteristics of the repeated external links in knowledge sharing on Stack Overflow. In this paper, we analyze the repeated external links on Stack Overflow. We observe that external links that point to the text resources (hosted in documentation websites, tutorial websites, etc.) are repeatedly shared the most. We observe that: 1) different users repeatedly share the same knowledge in the form of repeated external links, thus increasing the maintenance effort of knowledge (e.g., update invalid links in multiple posts), 2) the same users can repeatedly share the external links for the purpose of promotion, and 3) external links can point to webpages with an overload of information that is difficult for users to retrieve relevant information. Our findings provide insights to Stack Overflow moderators and researchers. For example, we encourage Stack Overflow to centrally manage the commonly occurring knowledge in the form of repeated external links in order to better maintain the crowdsourced knowledge on Stack Overflow.

  • Conference Article
  • Cite Count Icon 25
  • 10.1145/3463274.3463805
Towards offensive language detection and reduction in four Software Engineering communities
  • Jun 21, 2021
  • Jithin Cheriyan + 2 more

Software Engineering (SE) communities such as Stack Overflow have become unwelcoming, particularly through members' use of offensive language. Research has shown that offensive language drives users away from active engagement within these platforms. This work aims to explore this issue more broadly by investigating the nature of offensive language in comments posted by users in four prominent SE platforms - GitHub, Gitter, Slack and Stack Overflow (SO). It proposes an approach to detect and classify offensive language in SE communities by adopting natural language processing and deep learning techniques. Further, a Conflict Reduction System (CRS), which identifies offence and then suggests what changes could be made to minimize offence has been proposed. Beyond showing the prevalence of offensive language in over 1 million comments from four different communities which ranges from 0.07% to 0.43%, our results show promise in successful detection and classification of such language. The CRS system has the potential to drastically reduce manual moderation efforts to detect and reduce offence in SE communities.

  • Research Article
  • Cite Count Icon 53
  • 10.1109/tse.2019.2954319
Reading Answers on Stack Overflow: Not Enough!
  • Dec 5, 2019
  • IEEE Transactions on Software Engineering
  • Haoxiang Zhang + 3 more

Stack Overflow is one of the most active communities for developers to share their programming knowledge. Answers posted on Stack Overflow help developers solve issues during software development. In addition to posting answers, users can also post comments to further discuss their associated answers. As of Aug 2017, there are 32.3 million comments that are associated with answers, forming a large collection of crowdsourced repository of knowledge on top of the commonly-studied Stack Overflow answers. In this study, we wish to understand how the commenting activities contribute to the crowdsourced knowledge. We investigate what users discuss in comments, and analyze the characteristics of the commenting dynamics, (i.e., the timing of commenting activities and the roles of commenters). We find that: 1) the majority of comments are informative and thus can enhance their associated answers from a diverse range of perspectives. However, some comments contain content that is discouraged by Stack Overflow. 2) The majority of commenting activities occur after the acceptance of an answer. More than half of the comments are fast responses occurring within one day of the creation of an answer, while later comments tend to be more informative. Most comments are rarely integrated back into their associated answers, even though such comments are informative. 3) Insiders (i.e., users who posted questions/answers before posting a comment in a question thread) post the majority of comments within one month, and outsiders (i.e., users who never posted any question/answer before posting a comment) post the majority of comments after one month. Inexperienced users tend to raise limitations and concerns while experienced users tend to enhance the answer through commenting. Our study provides insights into the commenting activities in terms of their content, timing, and the individuals who perform the commenting. For the purpose of long-term knowledge maintenance and effective information retrieval for developers, we also provide actionable suggestions to encourage Stack Overflow users/engineers/moderators to leverage our insights for enhancing the current Stack Overflow commenting system for improving the maintenance and organization of the crowdsourced knowledge.

  • Research Article
  • Cite Count Icon 40
  • 10.1016/j.infsof.2021.106667
On the value of encouraging gender tolerance and inclusiveness in software engineering communities
  • Nov 1, 2021
  • Information and Software Technology
  • Elijah Zolduoarrati + 1 more

On the value of encouraging gender tolerance and inclusiveness in software engineering communities

  • Conference Article
  • Cite Count Icon 90
  • 10.1109/saner.2017.7884629
Stack Overflow: A code laundering platform?
  • Feb 1, 2017
  • Le An + 3 more

Developers use Question and Answer (Q&A) websites to exchange knowledge and expertise. Stack Overflow is a popular Q&A website where developers discuss coding problems and share code examples. Although all Stack Overflow posts are free to access, code examples on Stack Overflow are governed by the Creative Commons Attribute-ShareAlike 3.0 Unported license that developers should obey when reusing code from Stack Overflow or posting code to Stack Overflow. In this paper, we conduct a case study with 399 Android apps, to investigate whether developers respect license terms when reusing code from Stack Overflow posts (and the other way around). We found 232 code snippets in 62 Android apps from our dataset that were potentially reused from Stack Overflow, and 1,226 Stack Overflow posts containing code examples that are clones of code released in 68 Android apps, suggesting that developers may have copied the code of these apps to answer Stack Overflow questions. We investigated the licenses of these pieces of code and observed 1,279 cases of potential license violations (related to code posting to Stack overflow or code reuse from Stack overflow). This paper aims to raise the awareness of the software engineering community about potential unethical code reuse activities taking place on Q&A websites like Stack Overflow.

  • Conference Article
  • Cite Count Icon 314
  • 10.1109/socialcom.2013.35
StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge
  • Sep 1, 2013
  • Bogdan Vasilescu + 2 more

Stack Overflow is a popular on-line programming question and answer community providing its participants with rapid access to knowledge and expertise of their peers, especially benefitting coders. Despite the popularity of Stack Overflow, its role in the work cycle of open-source developers is yet to be understood: on the one hand, participation in it has the potential to increase the knowledge of individual developers thus improving and speeding up the development process. On the other hand, participation in Stack Overflow may interrupt the regular working rhythm of the developer, hence also possibly slow down the development process. In this paper we investigate the interplay between Stack Overflow activities and the development process, reflected by code changes committed to the largest social coding repository, GitHub. Our study shows that active GitHub committers ask fewer questions and provide more answers than others. Moreover, we observe that active Stack Overflow askers distribute their work in a less uniform way than developers that do not ask questions. Finally, we show that despite the interruptions incurred, the Stack Overflow activity rate correlates with the code changing activity in GitHub.

  • Conference Article
  • Cite Count Icon 139
  • 10.1109/saner.2018.8330202
Mining stackoverflow for program repair
  • Mar 1, 2018
  • Xuliang Liu + 1 more

In recent years, automatic program repair has been a hot research topic in the software engineering community, and many approaches have been proposed. Although these approaches produce promising results, some researchers criticize that existing approaches are still limited in their repair capability, due to their limited repair templates. Indeed, it is quite difficult to design effective repair templates. An award-wining paper analyzes thousands of manual bug fixes, but summarizes only ten repair templates. Although more bugs are thus repaired, recent studies show such repair templates are still insufficient. We notice that programmers often refer to Stack Overflow, when they repair bugs. With years of accumulation, Stack Overflow has millions of posts that are potentially useful to repair many bugs. The observation motives our work towards mining repair templates from Stack Overflow. In this paper, we propose a novel approach, called SOFix, that extracts code samples from Stack Overflow, and mines repair patterns from extracted code samples. Based on our mined repair patterns, we derived 13 repair templates. We implemented these repair templates in SOFix, and conducted evaluations on the widely used benchmark, Defects4J. Our results show that SOFix repaired 23 bugs, which are more than existing approaches. After comparing repaired bugs and templates, we find that SOFix repaired more bugs, since it has more repair templates. In addition, our results also reveal the urgent need for better fault localization techniques.

  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.jss.2023.111608
Characterizing architecture related posts and their usefulness in Stack Overflow
  • Jan 5, 2023
  • Journal of Systems and Software
  • Musengamana Jean De Dieu + 3 more

Characterizing architecture related posts and their usefulness in Stack Overflow

  • Conference Article
  • Cite Count Icon 15
  • 10.1145/3468264.3468582
Characterizing search activities on stack overflow
  • Aug 18, 2021
  • Jiakun Liu + 5 more

To solve programming issues, developers commonly search on Stack Overflow to seek potential solutions. However, there is a gap between the knowledge developers are interested in and the knowledge they are able to retrieve using search engines. To help developers efficiently retrieve relevant knowledge on Stack Overflow, prior studies proposed several techniques to reformulate queries and generate summarized answers. However, few studies performed a large-scale analysis using real-world search logs. In this paper, we characterize how developers search on Stack Overflow using such logs. By doing so, we identify the challenges developers face when searching on Stack Overflow and seek opportunities for the platform and researchers to help developers efficiently retrieve knowledge. To characterize search activities on Stack Overflow, we use search log data based on requests to Stack Overflow's web servers. We find that the most common search activity is reformulating the immediately preceding queries. Related work looked into query reformulations when using generic search engines and found 13 types of query reformulation strategies. Compared to their results, we observe that 71.78% of the reformulations can be fitted into those reformulation strategies. In terms of how queries are structured, 17.41% of the search sessions only search for fragments of source code artifacts (e.g., class and method names) without specifying the names of programming languages, libraries, or frameworks. Based on our findings, we provide actionable suggestions for Stack Overflow moderators and outline directions for future research. For example, we encourage Stack Overflow to set up a database that includes the relations between all computer programming terminologies shared on Stack Overflow, e.g., method name, data structure name, design pattern, and IDE name. By doing so, Stack Overflow could improve the performance of search engines by considering related programming terminologies at different levels of granularity.

  • Research Article
  • Cite Count Icon 10
  • 10.1007/s11280-018-0621-y
LinkLive: discovering Web learning resources for developers from Q&A discussions
  • Jul 23, 2018
  • World Wide Web
  • Jing Li + 2 more

Software developers need access to correlated information (e.g., API documentation, Wikipedia pages, Stack Overflow questions and answers) which are often dispersed among different Web resources. This paper is concerned with the situation where a developer is visiting a Web page, but at the same time is willing to explore correlated Web resources to extend his/her knowledge or to satisfy his/her curiosity. Specifically, we present an item-based collaborative filtering technique, named LinkLive, for automatically recommending a list of correlated Web resources for a particular Web page. The recommendation is done by exploiting hyperlink associations from the crowdsourced knowledge on Stack Overflow. We motivate our research using an exploratory study of hyperlink dissemination patterns on Stack Overflow. We then present our LinkLive technique that uses multiple features, including hyperlink co-occurrences in Q&A discussions, locations (e.g., question, answer, or comment) in which hyperlinks are referenced, and votes for posts/comments in which hyperlinks are referenced. Experiments using 7 years of Stack Overflow data show that, our technique recommends correlated Web resources with promising accuracy in an open setting. A user study of 6 participants suggests that practitioners find the recommended Web resources useful for Web discovery.

  • Conference Article
  • Cite Count Icon 57
  • 10.1109/icsme.2018.00057
Effective Reformulation of Query for Code Search Using Crowdsourced Knowledge and Extra-Large Data Analytics
  • Sep 1, 2018
  • Mohammad Masudur Rahman + 1 more

Software developers frequently issue generic natural language queries for code search while using code search engines (e.g., GitHub native search, Krugle). Such queries often do not lead to any relevant results due to vocabulary mismatch problems. In this paper, we propose a novel technique that automatically identifies relevant and specific API classes from Stack Overflow Q & A site for a programming task written as a natural language query, and then reformulates the query for improved code search. We first collect candidate API classes from Stack Overflow using pseudo-relevance feedback and two term weighting algorithms, and then rank the candidates using Borda count and semantic proximity between query keywords and the API classes. The semantic proximity has been determined by an analysis of 1.3 million questions and answers of Stack Overflow. Experiments using 310 code search queries report that our technique suggests relevant API classes with 48% precision and 58% recall which are 32% and 48% higher respectively than those of the state-of-the-art. Comparisons with two state-of-the-art studies and three popular search engines (e.g., Google, Stack Overflow, and GitHub native search) report that our reformulated queries (1) outperform the queries of the state-of-the-art, and (2) significantly improve the code search results provided by these contemporary search engines.

  • Conference Article
  • Cite Count Icon 21
  • 10.1145/2961111.2962585
Moving to Stack Overflow
  • Sep 8, 2016
  • Fabio Calefato + 2 more

Context: Recently, more and more developer communities are abandoning their legacy support forums, moving onto Stack Overflow. The motivations are diverse, yet they typically include achieving faster response time and larger visibility through the access to a modern and very successful infrastructure. One downside of migration, however, is that the history and the crowdsourced knowledge hosted at previous sites remain separated or even get lost if a community decides to abandon completely the legacy developer forum.Goal: Adding to the body of evidence of existing research on best-answer prediction, here we show that, from a technical perspective, the content from existing developer forums might be automatically migrated to the Stack Overflow, although most of forums do not allow to mark a question as resolved, a distinctive feature of modern Q&A sites.Method: We trained a binary classifier with data from Stack Overflow and then tested it with data scraped from Docusign, a developer forum that has recently completed the move.Results: Our findings show that best answers can be predicted with a good accuracy, only relying on shallow linguistic (text) features, such as answer length and the number of sentences, combined with other features like answer upvotes and age, which can be easily computed in near real-time.Conclusions: Results provide an initial yet positive evidence towards the automatic migration of crowdsourced knowledge from legacy forums to modern Q&A sites.

  • Conference Article
  • Cite Count Icon 16
  • 10.1145/2851613.2851772
Software-specific part-of-speech tagging
  • Apr 4, 2016
  • Deheng Ye + 3 more

Part-of-speech (POS) tagging performance degrades on out-of-domain data due to the lack of domain knowledge. Software engineering knowledge, embodied in textual documentations, bug reports and online forum discussions, is expressed in natural language, but is full of domain terms, software entities and software-specific informal languages. Such software texts call for software-specific POS tagging. In the software engineering community, there have been several attempts leveraging POS tagging technique to help solve software engineering tasks. However, little work is done for POS tagging on software natural language texts. In this paper, we build a software-specific POS tagger, called S-POS, for processing the textual discussions on Stack Overflow. We target at Stack Overflow because it has become an important developer-generated knowledge repository for software engineering. We define a POS tagset that is suitable for describing software engineering knowledge, select corpus, develop a custom tokenizer, annotate data, design features for supervised model training, and demonstrate that the tagging accuracy of S-POS outperforms that of the Stanford POS Tagger when tagging software texts. Our work presents a feasible roadmap to build software-specific POS tagger for the socio-professional contents on Stack Overflow, and reveals challenges and opportunities for advanced software-specific information extraction.

  • Research Article
  • Cite Count Icon 47
  • 10.1109/tse.2021.3058985
A Study of C/C++ Code Weaknesses on Stack Overflow
  • Feb 23, 2021
  • IEEE Transactions on Software Engineering
  • Haoxiang Zhang + 4 more

Stack Overflow hosts millions of solutions that aim to solve developers' programming issues. In this crowdsourced question answering process, Stack Overflow becomes a code hosting website where developers actively share its code. However, code snippets on Stack Overflow may contain security vulnerabilities, and if shared carelessly, such snippets can introduce security problems in software systems. In this paper, we empirically study the prevalence of the <i>Common Weakness Enumeration</i> – CWE, in code snippets of C/C++ related answers. We explore the characteristics of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , i.e., code snippets that have CWE instances, in terms of the types of weaknesses, the evolution of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , and who contributed such code snippets. We find that: 1) 36 percent (i.e., 32 out of 89) CWE types are detected in <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> on Stack Overflow. Particularly, CWE-119, i.e., <i>improper restriction of operations within the bounds of a memory buffer</i> , is common in both answer code snippets and real-world software systems. Furthermore, the proportion of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> doubled from 2008 to 2018 after normalizing by the total number of C/C++ snippets in each year. 2) In general, code revisions are associated with a reduction in the number of code weaknesses. However, the majority of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> had weaknesses introduced in the first version of the code, and these <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> were never revised since then. Only 7.5 percent of users who contributed C/C++ code snippets posted or edited code with weaknesses. Users contributed less code with CWE weakness when they were more active (i.e., they either revised more code snippets or had a higher reputation). We also find that some users tended to have the same CWE type repeatedly in their various code snippets. Our empirical study provides insights to users who share code snippets on Stack Overflow so that they are aware of the potential security issues. To understand the community feedback about improving code weaknesses by answer revisions, we also conduct a qualitative study and find that 62.5 percent of our suggested revisions are adopted by the community. Stack Overflow can perform CWE scanning for all the code that is hosted on its platform. Further research is needed to improve the quality of the crowdsourced knowledge on Stack Overflow.

  • Research Article
  • Cite Count Icon 141
  • 10.1016/j.infsof.2017.10.009
How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow
  • Nov 6, 2017
  • Information and Software Technology
  • Fabio Calefato + 2 more

How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant