Stack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security
S.121-136
- Research Article
104
- 10.1007/s10664-018-9650-5
- Oct 1, 2018
- Empirical Software Engineering
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of copyable code snippets. Using those snippets raises maintenance and legal issues. SO’s license (CC BY-SA 3.0) requires attribution, i.e., referencing the original question or answer, and requires derived work to adopt a compatible license. While there is a heated debate on SO’s license model for code snippets and the required attribution, little is known about the extent to which snippets are copied from SO without proper attribution. We present results of a large-scale empirical study analyzing the usage and attribution of non-trivial Java code snippets from SO answers in public GitHub (GH) projects. We followed three different approaches to triangulate an estimate for the ratio of unattributed usages and conducted two online surveys with software developers to complement our results. For the different sets of projects that we analyzed, the ratio of projects containing files with a reference to SO varied between 3.3% and 11.9%. We found that at most 1.8% of all analyzed repositories containing code from SO used the code in a way compatible with CC BY-SA 3.0. Moreover, we estimate that at most a quarter of the copied code snippets from SO are attributed as required. Of the surveyed developers, almost one half admitted copying code from SO without attribution and about two thirds were not aware of the license of SO code snippets and its implications.
- Dissertation
- 10.33915/etd.4069
- Jul 29, 2019
As the popularity of modern social coding paradigm such as Stack Overflow grows, its potential security risks increase as well (e.g., insecure codes could be easily embedded and distributed). To address this largely overlooked issue, we bring a new insight to exploit social coding properties in addition to code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers, code snippets and keywords in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose two different novel network embedding models named Snippet2vec and CodeHin2Vec for representation learning in HIN to automate the insecure code snippet detection in Stack Overflow. More specifically, Snippet2vec learns the low dimensional representations for the nodes (i.e., code snippets) in the HIN where both the HIN structures and semantics are maximally preserved, while CodeHin2Vec utilizes HIN to depict relatedness over code snippets to generate code-to-code sequences, based on which sequence-to-sequence (seq2seq) concept in machine translation is further leveraged to learn representations of code snippets. Accordingly, we developed systems ICSD and iTrustSO which integrate our proposed methods respectively in insecure code snippet detection in Stack Overflow. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of our developed systems by comparisons with the state-of-the-art baseline methods.
- Conference Article
274
- 10.1109/sp.2016.25
- May 1, 2016
Vulnerabilities in Android code -- including but not limited to insecure data storage, unprotected inter-component communication, broken TLS implementations, and violations of least privilege -- have enabled real-world privacy leaks and motivated research cataloguing their prevalence and impact. Researchers have speculated that appification promotes security problems, as it increasingly allows inexperienced laymen to develop complex and sensitive apps. Anecdotally, Internet resources such as Stack Overflow are blamed for promoting insecure solutions that are naively copy-pasted by inexperienced developers. In this paper, we for the first time systematically analyzed how the use of information resources impacts code security. We first surveyed 295 app developers who have published in the Google Play market concerning how they use resources to solve security-related problems. Based on the survey results, we conducted a lab study with 54 Android developers (students and professionals), in which participants wrote security-and privacy-relevant code under time constraints. The participants were assigned to one of four conditions: free choice of resources, Stack Overflow only, official Android documentation only, or books only. Those participants who were allowed to use only Stack Overflow produced significantly less secure code than those using, the official Android documentation or books, while participants using the official Android documentation produced significantly less functional code than those using Stack Overflow. To assess the quality of Stack Overflow as a resource, we surveyed the 139 threads our participants accessed during the study, finding that only 25% of them were helpful in solving the assigned tasks and only 17% of them contained secure code snippets. In order to obtain ground truth concerning the prevalence of the secure and insecure code our participants wrote in the lab study, we statically analyzed a random sample of 200,000 apps from Google Play, finding that 93.6% of the apps used at least one of the API calls our participants used during our study. We also found that many of the security errors made by our participants also appear in the wild, possibly also originating in the use of Stack Overflow to solve programming problems. Taken together, our results confirm that API documentation is secure but hard to use, while informal documentation such as Stack Overflow is more accessible but often leads to insecurity. Given time constraints and economic pressures, we can expect that Android developers will continue to choose those resources that are easiest to use, therefore, our results firmly establish the need for secure-but-usable documentation.
- Research Article
47
- 10.1109/tse.2021.3058985
- Feb 23, 2021
- IEEE Transactions on Software Engineering
Stack Overflow hosts millions of solutions that aim to solve developers' programming issues. In this crowdsourced question answering process, Stack Overflow becomes a code hosting website where developers actively share its code. However, code snippets on Stack Overflow may contain security vulnerabilities, and if shared carelessly, such snippets can introduce security problems in software systems. In this paper, we empirically study the prevalence of the <i>Common Weakness Enumeration</i> – CWE, in code snippets of C/C++ related answers. We explore the characteristics of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , i.e., code snippets that have CWE instances, in terms of the types of weaknesses, the evolution of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> , and who contributed such code snippets. We find that: 1) 36 percent (i.e., 32 out of 89) CWE types are detected in <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> on Stack Overflow. Particularly, CWE-119, i.e., <i>improper restriction of operations within the bounds of a memory buffer</i> , is common in both answer code snippets and real-world software systems. Furthermore, the proportion of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> doubled from 2008 to 2018 after normalizing by the total number of C/C++ snippets in each year. 2) In general, code revisions are associated with a reduction in the number of code weaknesses. However, the majority of <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> had weaknesses introduced in the first version of the code, and these <inline-formula><tex-math notation="LaTeX">$Code_w$</tex-math></inline-formula> were never revised since then. Only 7.5 percent of users who contributed C/C++ code snippets posted or edited code with weaknesses. Users contributed less code with CWE weakness when they were more active (i.e., they either revised more code snippets or had a higher reputation). We also find that some users tended to have the same CWE type repeatedly in their various code snippets. Our empirical study provides insights to users who share code snippets on Stack Overflow so that they are aware of the potential security issues. To understand the community feedback about improving code weaknesses by answer revisions, we also conduct a qualitative study and find that 62.5 percent of our suggested revisions are adopted by the community. Stack Overflow can perform CWE scanning for all the code that is hosted on its platform. Further research is needed to improve the quality of the crowdsourced knowledge on Stack Overflow.
- Research Article
54
- 10.1109/tse.2020.3023664
- Sep 4, 2020
- IEEE Transactions on Software Engineering
Software developers share programming solutions in Q&A sites like Stack Overflow, Stack Exchange, Android forum, and so on. The reuse of crowd-sourced code snippets can facilitate rapid prototyping. However, recent research shows that the shared code snippets may be of low quality and can even contain vulnerabilities. This paper aims to understand the nature and the prevalence of security vulnerabilities in crowd-sourced code examples. To achieve this goal, we investigate security vulnerabilities in the C++ code snippets shared on Stack Overflow over a period of 10 years. In collaborative sessions involving multiple human coders, we manually assessed each code snippet for security vulnerabilities following CWE (Common Weakness Enumeration) guidelines. From the 72,483 reviewed code snippets used in at least one project hosted on GitHub, we found a total of 99 vulnerable code snippets categorized into 31 types. Many of the investigated code snippets are still not corrected on Stack Overflow. The 99 vulnerable code snippets found in Stack Overflow were reused in a total of 2859 GitHub projects. To help improve the quality of code snippets shared on Stack Overflow, we developed a browser extension that allows Stack Overflow users to be notified for vulnerabilities in code snippets when they see them on the platform.
- Conference Article
18
- 10.1145/3468264.3473114
- Aug 18, 2021
Stack Overflow is one of the most popular technical Q&A sites used by software developers. Seeking help from Stack Overflow has become an essential part of software developers’ daily work for solving programming-related questions. Although the Stack Overflow community has provided quality assurance guidelines to help users write better questions, we observed that a significant number of questions submitted to Stack Overflow are of low quality. In this paper, we introduce a new web-based tool, Code2Que, which can help developers in writing higher quality questions for a given code snippet. Code2Que consists of two main stages: offline learning and online recommendation. In the offline learning phase, we first collect a set of good quality ⟨code snippet, question⟩ pairs as training samples. We then train our model on these training samples via a deep sequence-to-sequence approach, enhanced with an attention mechanism, a copy mechanism and a coverage mechanism. In the online recommendation phase, for a given code snippet, we use the offline trained model to generate question titles to assist less experienced developers in writing questions more effectively. To evaluate Code2Que, we first sampled 50 low quality ⟨code snippet, question⟩ pairs from the Python and Java datasets on Stack Overflow. Then we conducted a user study to evaluate the question titles generated by our approach as compared to human-written ones using three metrics: Clearness, Fitness and Willingness to Respond. Our experimental results show that for a large number of low-quality questions in Stack Overflow, Code2Que can improve the question titles in terms of Clearness, Fitness and Willingness measures.
- Research Article
10
- 10.1016/j.jss.2024.111964
- Jan 8, 2024
- Journal of Systems and Software
An empirical study of code reuse between GitHub and stack overflow during software development
- Conference Article
3
- 10.1145/3341161.3343524
- Aug 27, 2019
Despite the apparent benefits of modern social coding paradigm such as Stack Overflow, its potential security risks have been largely overlooked (e.g., insecure codes could be easily embedded and distributed). To address this imminent issue, in this paper, we bring a significant insight to leverage both social coding properties and code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers and code snippets in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose a novel hierarchical attention-based sequence learning model named CodeHin2Vec to seamlessly integrate node (i.e., code snippet) content with HIN-based relations for representation learning. After that, a classifier is built for insecure code snippet detection. Integrating our proposed method, an intelligent system named iTrustSO is accordingly developed to address the code security issues in modern software coding platforms. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of our developed system iTrustSO by comparisons with alternative methods.
- Research Article
25
- 10.1016/j.jss.2019.110505
- Dec 23, 2019
- Journal of Systems and Software
SCC++: Predicting the programming language of questions and snippets of Stack Overflow
- Research Article
125
- 10.1109/tse.2019.2900307
- Jun 20, 2018
- IEEE Transactions on Software Engineering
Online code clones are code fragments that are copied from software projects or online sources to Stack Overflow as examples. Due to an absence of a checking mechanism after the code has been copied to Stack Overflow, they can become toxic code snippets, e.g., they suffer from being outdated or violating the original software license. We present a study of online code clones on Stack Overflow and their toxicity by incorporating two developer surveys and a large-scale code clone detection. A survey of 201 high-reputation Stack Overflow answerers (33 percent response rate) showed that 131 participants (65 percent) have ever been notified of outdated code and 26 of them (20 percent) rarely or never fix the code. 138 answerers (69 percent) never check for licensing conflicts between their copied code snippets and Stack Overflow's CC BY-SA 3.0. A survey of 87 Stack Overflow visitors shows that they experienced several issues from Stack Overflow answers: mismatched solutions, outdated solutions, incorrect solutions, and buggy code. 85 percent of them are not aware of CC BY-SA 3.0 license enforced by Stack Overflow, and 66 percent never check for license conflicts when reusing code snippets. Our clone detection found online clone pairs between 72,365 Java code snippets on Stack Overflow and 111 open source projects in the curated Qualitas corpus. We analysed 2,289 non-trivial online clone candidates. Our investigation revealed strong evidence that 153 clones have been copied from a Qualitas project to Stack Overflow. We found 100 of them (66 percent) to be outdated, of which 10 were buggy and harmful for reuse. Furthermore, we found 214 code snippets that could potentially violate the license of their original software and appear 7,112 times in 2,427 GitHub projects.
- Conference Article
19
- 10.1109/msr.2019.00047
- May 1, 2019
Stack Overflow (SO) is a popular Q&A forum for software developers, providing a large number of copyable code snippets. While GitHub is an independent code collaboration platform, developers often reuse SO code in their GitHub projects. In this paper, we investigate how often GitHub developers re-use code snippets from the SO forum, as well as what concepts they are more likely to reference in their code. To accomplish our goal, we mine SOTorrent dataset that provides connectivity between code snippets on the SO posts with software projects hosted on GitHub. We then study the characteristics of GitHub projects that reference SO posts and what popular SO discussions can be found in GitHub projects. Our results demonstrate that on average developers make 45 references to SO posts in their projects, with the highest number of references being made within the JavaScript code. We also found that 79% of the SO posts with code snippets that are referenced in GitHub code do change over time (at least ones) raising code maintainability and reliability concerns.
- Conference Article
27
- 10.1109/msr.2019.00042
- May 1, 2019
Software developers all over the world use Stack Overflow (SO) to interact and exchange code snippets. Research also uses SO to harvest code snippets for use with recommendation systems. However, previous work has shown that code on SO may have quality issues, such as security or license problems. We analyse Python code on SO to determine its coding style compliance. From 1,962,535 code snippets tagged with 'python', we extracted 407,097 snippets of at least 6 statements of Python code. Surprisingly, 93.87% of the extracted snippets contain style violations, with an average of 0.7 violations per statement and a huge number of snippets with a considerably higher ratio. Researchers and developers should, therefore, be aware that code snippets on SO may not representative of good coding style. Furthermore, while user reputation seems to be unrelated to coding style compliance, for posts with vote scores in the range between -10 and 20, we found a strong correlation (r = -0.87, p <; 10^-7) between the vote score a post received and the average number of violations per statement for snippets in such posts.
- Research Article
36
- 10.1016/j.scico.2020.102516
- Jul 10, 2020
- Science of Computer Programming
Understanding stack overflow code quality: A recommendation of caution
- Conference Article
149
- 10.1145/3196398.3196430
- May 28, 2018
Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on a first analysis using the dataset, we present insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text. Further, our analysis revealed a close relationship between post edits and comments. Our vision is that researchers will use SOTorrent to investigate and understand the evolution of SO posts and their relation to other platforms such as GitHub.
- Conference Article
18
- 10.1109/secdev.2019.00016
- Sep 1, 2019
Research demonstrates that code snippets listed on programming-oriented online forums (e.g., Stack Overflow) – including snippets containing security mistakes – make their way into production code. Prior work also shows that software developers who reference Stack Overflow in their development cycle produce less secure code. While there are many plausible explanations for why developers propagate insecure code in this manner, there is little or no empirical evidence. To address this question, we identify Stack Overflow code snippets that contain security errors and find clones of these snippets in open source GitHub repositories. We then survey (n=133) and interview (n=15) the authors of these GitHub repositories to explore how and why these errors were introduced. We find that some developers (perhaps mistakenly) trust their security skills to validate the code they import, but the majority admit they would need to learn more about security before they could properly perform such validation. Further, although some prioritize functionality over security, others believe that ensuring security is not, or should not be, their responsibility. Our results have implications for attempts to ameliorate the propagation of this insecure code.