SOTorrent

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.

Similar Papers
  • Conference Article
  • Cite Count Icon 76
  • 10.1109/msr.2019.00038
SOTorrent: Studying the Origin, Evolution, and Usage of Stack Overflow Code Snippets
  • May 1, 2019
  • Sebastian Baltes + 2 more

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Like other software artifacts, code on SO evolves over time, for example when bugs are fixed or APIs are updated to the most recent version. To be able to analyze how code and the surrounding text on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text and code blocks. It connects code snippets from SO posts to other platforms by aggregating URLs from surrounding text blocks and comments, and by collecting references from GitHub files to SO posts. Our vision is that researchers will use SOTorrent to investigate and understand the evolution and maintenance of code on SO and its relation to other platforms such as GitHub.

  • Research Article
  • Cite Count Icon 104
  • 10.1007/s10664-018-9650-5
Usage and attribution of Stack Overflow code snippets in GitHub projects
  • Oct 1, 2018
  • Empirical Software Engineering
  • Sebastian Baltes + 1 more

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.

  • Research Article
  • Cite Count Icon 47
  • 10.1109/tse.2021.3058985
A Study of C/C++ Code Weaknesses on Stack Overflow
  • Feb 23, 2021
  • IEEE Transactions on Software Engineering
  • Haoxiang Zhang + 4 more

Stack Overflow hosts millions of solutions that aim to solve developers' programming issues. In this crowdsourced question answering process, Stack Overflow becomes a code hosting website where developers actively share its code. However, code snippets on Stack Overflow may contain security vulnerabilities, and if shared carelessly, such snippets can introduce security problems in software systems. In this paper, we empirically study the prevalence of the <i>Common Weakness Enumeration</i> – CWE, in code snippets of C/C++ related answers. We explore the characteristics of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , i.e., code snippets that have CWE instances, in terms of the types of weaknesses, the evolution of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , and who contributed such code snippets. We find that: 1) 36 percent (i.e., 32 out of 89) CWE types are detected in <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> on Stack Overflow. Particularly, CWE-119, i.e., <i>improper restriction of operations within the bounds of a memory buffer</i> , is common in both answer code snippets and real-world software systems. Furthermore, the proportion of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> doubled from 2008 to 2018 after normalizing by the total number of C/C++ snippets in each year. 2) In general, code revisions are associated with a reduction in the number of code weaknesses. However, the majority of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> had weaknesses introduced in the first version of the code, and these <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> were never revised since then. Only 7.5 percent of users who contributed C/C++ code snippets posted or edited code with weaknesses. Users contributed less code with CWE weakness when they were more active (i.e., they either revised more code snippets or had a higher reputation). We also find that some users tended to have the same CWE type repeatedly in their various code snippets. Our empirical study provides insights to users who share code snippets on Stack Overflow so that they are aware of the potential security issues. To understand the community feedback about improving code weaknesses by answer revisions, we also conduct a qualitative study and find that 62.5 percent of our suggested revisions are adopted by the community. Stack Overflow can perform CWE scanning for all the code that is hosted on its platform. Further research is needed to improve the quality of the crowdsourced knowledge on Stack Overflow.

  • Conference Article
  • Cite Count Icon 270
  • 10.1109/sp.2017.31
Stack Overflow Considered Harmful? The Impact of Copy&amp;Paste on Android Application Security
  • May 1, 2017
  • Felix Fischer + 6 more

S.121-136

  • Research Article
  • Cite Count Icon 54
  • 10.1109/tse.2020.3023664
An Empirical Study of C++ Vulnerabilities in Crowd-Sourced Code Examples
  • Sep 4, 2020
  • IEEE Transactions on Software Engineering
  • Morteza Verdi + 5 more

Software developers share programming solutions in Q&A sites like Stack Overflow, Stack Exchange, Android forum, and so on. The reuse of crowd-sourced code snippets can facilitate rapid prototyping. However, recent research shows that the shared code snippets may be of low quality and can even contain vulnerabilities. This paper aims to understand the nature and the prevalence of security vulnerabilities in crowd-sourced code examples. To achieve this goal, we investigate security vulnerabilities in the C++ code snippets shared on Stack Overflow over a period of 10 years. In collaborative sessions involving multiple human coders, we manually assessed each code snippet for security vulnerabilities following CWE (Common Weakness Enumeration) guidelines. From the 72,483 reviewed code snippets used in at least one project hosted on GitHub, we found a total of 99 vulnerable code snippets categorized into 31 types. Many of the investigated code snippets are still not corrected on Stack Overflow. The 99 vulnerable code snippets found in Stack Overflow were reused in a total of 2859 GitHub projects. To help improve the quality of code snippets shared on Stack Overflow, we developed a browser extension that allows Stack Overflow users to be notified for vulnerabilities in code snippets when they see them on the platform.

  • Dissertation
  • 10.33915/etd.4069
Automatic Detection of Insecure Codes in Stack Overflow
  • Jul 29, 2019
  • Shifu Hou

As the popularity of modern social coding paradigm such as Stack Overflow grows, its potential security risks increase as well (e.g., insecure codes could be easily embedded and distributed). To address this largely overlooked issue, we bring a new insight to exploit social coding properties in addition to code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers, code snippets and keywords in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose two different novel network embedding models named Snippet2vec and CodeHin2Vec for representation learning in HIN to automate the insecure code snippet detection in Stack Overflow. More specifically, Snippet2vec learns the low dimensional representations for the nodes (i.e., code snippets) in the HIN where both the HIN structures and semantics are maximally preserved, while CodeHin2Vec utilizes HIN to depict relatedness over code snippets to generate code-to-code sequences, based on which sequence-to-sequence (seq2seq) concept in machine translation is further leveraged to learn representations of code snippets. Accordingly, we developed systems ICSD and iTrustSO which integrate our proposed methods respectively in insecure code snippet detection in Stack Overflow. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of our developed systems by comparisons with the state-of-the-art baseline methods.

  • Research Article
  • Cite Count Icon 125
  • 10.1109/tse.2019.2900307
Toxic Code Snippets on Stack Overflow
  • Jun 20, 2018
  • IEEE Transactions on Software Engineering
  • Chaiyong Ragkhitwetsagul + 4 more

Online code clones are code fragments that are copied from software projects or online sources to Stack Overflow as examples. Due to an absence of a checking mechanism after the code has been copied to Stack Overflow, they can become toxic code snippets, e.g., they suffer from being outdated or violating the original software license. We present a study of online code clones on Stack Overflow and their toxicity by incorporating two developer surveys and a large-scale code clone detection. A survey of 201 high-reputation Stack Overflow answerers (33 percent response rate) showed that 131 participants (65 percent) have ever been notified of outdated code and 26 of them (20 percent) rarely or never fix the code. 138 answerers (69 percent) never check for licensing conflicts between their copied code snippets and Stack Overflow's CC BY-SA 3.0. A survey of 87 Stack Overflow visitors shows that they experienced several issues from Stack Overflow answers: mismatched solutions, outdated solutions, incorrect solutions, and buggy code. 85 percent of them are not aware of CC BY-SA 3.0 license enforced by Stack Overflow, and 66 percent never check for license conflicts when reusing code snippets. Our clone detection found online clone pairs between 72,365 Java code snippets on Stack Overflow and 111 open source projects in the curated Qualitas corpus. We analysed 2,289 non-trivial online clone candidates. Our investigation revealed strong evidence that 153 clones have been copied from a Qualitas project to Stack Overflow. We found 100 of them (66 percent) to be outdated, of which 10 were buggy and harmful for reuse. Furthermore, we found 214 code snippets that could potentially violate the license of their original software and appear 7,112 times in 2,427 GitHub projects.

  • Conference Article
  • Cite Count Icon 19
  • 10.1109/msr.2019.00047
How Often and What StackOverflow Posts Do Developers Reference in Their GitHub Projects?
  • May 1, 2019
  • Saraj Singh Manes + 1 more

Stack Overflow (SO) is a popular Q&A forum for software developers, providing a large number of copyable code snippets. While GitHub is an independent code collaboration platform, developers often reuse SO code in their GitHub projects. In this paper, we investigate how often GitHub developers re-use code snippets from the SO forum, as well as what concepts they are more likely to reference in their code. To accomplish our goal, we mine SOTorrent dataset that provides connectivity between code snippets on the SO posts with software projects hosted on GitHub. We then study the characteristics of GitHub projects that reference SO posts and what popular SO discussions can be found in GitHub projects. Our results demonstrate that on average developers make 45 references to SO posts in their projects, with the highest number of references being made within the JavaScript code. We also found that 79% of the SO posts with code snippets that are referenced in GitHub code do change over time (at least ones) raising code maintainability and reliability concerns.

  • Conference Article
  • Cite Count Icon 18
  • 10.1145/3468264.3473114
Code2Que: a tool for improving question titles from mined code snippets in stack overflow
  • Aug 18, 2021
  • Zhipeng Gao + 4 more

Stack Overflow is one of the most popular technical Q&A sites used by software developers. Seeking help from Stack Overflow has become an essential part of software developers’ daily work for solving programming-related questions. Although the Stack Overflow community has provided quality assurance guidelines to help users write better questions, we observed that a significant number of questions submitted to Stack Overflow are of low quality. In this paper, we introduce a new web-based tool, Code2Que, which can help developers in writing higher quality questions for a given code snippet. Code2Que consists of two main stages: offline learning and online recommendation. In the offline learning phase, we first collect a set of good quality ⟨code snippet, question⟩ pairs as training samples. We then train our model on these training samples via a deep sequence-to-sequence approach, enhanced with an attention mechanism, a copy mechanism and a coverage mechanism. In the online recommendation phase, for a given code snippet, we use the offline trained model to generate question titles to assist less experienced developers in writing questions more effectively. To evaluate Code2Que, we first sampled 50 low quality ⟨code snippet, question⟩ pairs from the Python and Java datasets on Stack Overflow. Then we conducted a user study to evaluate the question titles generated by our approach as compared to human-written ones using three metrics: Clearness, Fitness and Willingness to Respond. Our experimental results show that for a large number of low-quality questions in Stack Overflow, Code2Que can improve the question titles in terms of Clearness, Fitness and Willingness measures.

  • Research Article
  • Cite Count Icon 36
  • 10.1016/j.scico.2020.102516
Understanding stack overflow code quality: A recommendation of caution
  • Jul 10, 2020
  • Science of Computer Programming
  • Sarah Meldrum + 3 more

Understanding stack overflow code quality: A recommendation of caution

  • Conference Article
  • Cite Count Icon 10
  • 10.1109/msr52588.2021.00040
Studying the Change Histories of Stack Overflow and GitHub Snippets
  • May 1, 2021
  • Saraj Singh Manes + 1 more

Stack Overflow is a popular Q&A forum for soft-ware developers, providing a large number of copyable code snippets. While GitHub is a collaborative development platform, developers often reuse Stack Overflow code in their GitHub projects. These snippets get revised or edited on each platform. In this work, we study Stack Overflow posts and the code snippets that are reused from these posts in GitHub projects. We investigate and compare the change history of SO snippets with the change history of GitHub snippets. We have applied a stratified random sampling when mining 440,000 GitHub projects to create a dataset representing the change history of the reused snippets; this dataset contains 22,900 GitHub projects, 33,765 Stack Overflow references mapped to 4,634 Stack Overflow posts, and a total of 73,322 commits.We analyze the evolution patterns of snippets on each platform, compare key trends and explore the co-change of these snippets. Our results demonstrate that 76% of snippets evolve on Stack Overflow, while only 22% of the reused code snippets evolve in GitHub. Stack Overflow snippets undergo fewer and smaller changes compared to their evolving counterparts on GitHub. The evolution of snippets on both platforms is driven by the original author of the content. Finally, we found that a small percentage of snippets is co-changing across two platforms, while snippets in GitHub and Stack Overflow evolve independently of one another.

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.jss.2024.111964
An empirical study of code reuse between GitHub and stack overflow during software development
  • Jan 8, 2024
  • Journal of Systems and Software
  • Xiangping Chen + 4 more

An empirical study of code reuse between GitHub and stack overflow during software development

  • Conference Article
  • Cite Count Icon 32
  • 10.1109/icse-c.2017.99
Attribution Required: Stack Overflow Code Snippets in GitHub Projects
  • May 1, 2017
  • Sebastian Baltes + 2 more

Stack Overflow (SO) is the largest Q&A website for developers, providing a huge amount of copyable code snippets. Using these snippets raises various maintenance and legal issues. The SO license requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO's license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. In this paper, we present the research design and summarized results of an empirical study analyzing attributed and unattributed usages of SO code snippets in GitHub projects. On average, 3.22% of all analyzed repositories and 7.33% of the popular ones contained a reference to SO. Further, we found that developers rather refer to the whole thread on SO than to a specific answer. For Java, at least two thirds of the copied snippets were not attributed.

  • Research Article
  • Cite Count Icon 1
  • 10.1142/s0218194022500310
An Empirical Study on Rule Violation History of JavaScript Code Blocks on Stack Overflow
  • May 1, 2022
  • International Journal of Software Engineering and Knowledge Engineering
  • Jungil Kim + 1 more

JavaScript code blocks on Stack Overflow (SO) are often used in software projects. However, little is known about the issue of rule violation risk in SO JavaScript code blocks. Rule violation is one of the factors which degrades the quality of Java Script code. To prevent prevalence of rule violation by reusing SO JavaScript code blocks, it is needed to investigate how secure SO JavaScript code blocks are against rule violation. To examine the issue, we performed a quantitative analysis to investigate how many rule violations are, when first rule violation occurs and what is the trend of rule violations in evolution history of Stack Overflow JavaScript code blocks. We collected SO posts related to JavaScript and extracted the code blocks contained in the posts. By using ESLint, the most popular rule violation detection tool, we identified rule violations in the evolution history of our target code blocks. We then performed quantitative analyses on the identified rule violations. As the results of the analyses, we found that: (1) 60% of the studied code blocks evolve with any rule violations. (2) In the rule violated code blocks, 92% of the code blocks get first rule violation occurrence in the early phase of their evolution. (3) 80% of the rule violated code blocks never fix existing rule violations during their evolution. Our findings suggest that SO should provide a policy which can reduce rule violations in submitted JavaScript code blocks. The findings can also make SO users attend to rule violations when reusing SO JavaScript code blocks.

  • Research Article
  • Cite Count Icon 25
  • 10.1016/j.jss.2019.110505
SCC++: Predicting the programming language of questions and snippets of Stack Overflow
  • Dec 23, 2019
  • Journal of Systems and Software
  • Kamel Alrashedy + 4 more

SCC++: Predicting the programming language of questions and snippets of Stack Overflow

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant