Stack Overflow: A code laundering platform?
Developers use Question and Answer (Q&A) websites to exchange knowledge and expertise. Stack Overflow is a popular Q&A website where developers discuss coding problems and share code examples. Although all Stack Overflow posts are free to access, code examples on Stack Overflow are governed by the Creative Commons Attribute-ShareAlike 3.0 Unported license that developers should obey when reusing code from Stack Overflow or posting code to Stack Overflow. In this paper, we conduct a case study with 399 Android apps, to investigate whether developers respect license terms when reusing code from Stack Overflow posts (and the other way around). We found 232 code snippets in 62 Android apps from our dataset that were potentially reused from Stack Overflow, and 1,226 Stack Overflow posts containing code examples that are clones of code released in 68 Android apps, suggesting that developers may have copied the code of these apps to answer Stack Overflow questions. We investigated the licenses of these pieces of code and observed 1,279 cases of potential license violations (related to code posting to Stack overflow or code reuse from Stack overflow). This paper aims to raise the awareness of the software engineering community about potential unethical code reuse activities taking place on Q&A websites like Stack Overflow.
- Research Article
31
- 10.1145/3550150
- Apr 26, 2023
- ACM Transactions on Software Engineering and Methodology
Stack Overflow has been heavily used by software developers to seek programming-related information. More and more developers use Community Question and Answer forums, such as Stack Overflow, to search for code examples of how to accomplish a certain coding task. This is often considered to be more efficient than working from source documentation, tutorials, or full worked examples. However, due to the complexity of these online Question and Answer forums and the very large volume of information they contain, developers can be overwhelmed by the sheer volume of available information. This makes it hard to find and/or even be aware of the most relevant code examples to meet their needs. To alleviate this issue, in this work, we present a query-driven code recommendation tool, named Que2Code , that identifies the best code snippets for a user query from Stack Overflow posts. Our approach has two main stages: (i) semantically equivalent question retrieval and (ii) best code snippet recommendation. During the first stage, for a given query question formulated by a developer, we first generate paraphrase questions for the input query as a way of query boosting and then retrieve the relevant Stack Overflow posted questions based on these generated questions. In the second stage, we collect all of the code snippets within questions retrieved in the first stage and develop a novel scheme to rank code snippet candidates from Stack Overflow posts via pairwise comparisons. To evaluate the performance of our proposed model, we conduct a large-scale experiment to evaluate the effectiveness of the semantically equivalent question retrieval task and best code snippet recommendation task separately on Python and Java datasets in Stack Overflow. We also perform a human study to measure how real-world developers perceive the results generated by our model. Both the automatic and human evaluation results demonstrate the promising performance of our model, and we have released our code and data to assist other researchers.
- Conference Article
183
- 10.1145/3180155.3180260
- May 27, 2018
Programmers often consult an online Q&A forum such as Stack Overflow to learn new APIs. This paper presents an empirical study on the prevalence and severity of API misuse on Stack Overflow. To reduce manual assessment effort, we design ExampleCheck, an API usage mining framework that extracts patterns from over 380K Java repositories on GitHub and subsequently reports potential API usage violations in Stack Overflow posts. We analyze 217,818 Stack Overflow posts using ExampleCheck and find that 31% may have potential API usage violations that could produce unexpected behavior such as program crashes and resource leaks. Such API misuse is caused by three main reasons---missing control constructs, missing or incorrect order of API calls, and incorrect guard conditions. Even the posts that are accepted as correct answers or upvoted by other programmers are not necessarily more reliable than other posts in terms of API misuse. This study result calls for a new approach to augment Stack Overflow with alternative API usage details that are not typically shown in curated examples.
- Conference Article
24
- 10.1109/aswec.2018.00027
- Nov 1, 2018
Solutions provided in Question and Answer (Q&A) websites such as Stack Overflow are regularly used in Open Source Software (OSS). However, many developers are unaware that both Stack Overflow and OSS are governed by licenses. Hence, developers reusing code from Stack Overflow for their OSS projects may violate licensing agreements if their attributions are not correct. Additionally, if code migrates from one OSS through Stack Overflow to another OSS, then complex licensing issues are likely to exist. Such forms of software reuse also have implications for future software maintenance, particularly where developers have poor understanding of copied code. This paper investigates code reuse between these two platforms (i.e., Stack Overflow and OSS), with the aim of providing insights into this issue. This study mined 151,946 Java code snippets from Stack Overflow, 16,617 Java files from 12 of the top weekly listed projects on SourceForge and GitHub, and 39,616 Java files from the top 20 most popular Java projects on SourceForge. Our analyses were aimed at finding the number of clones (indicating reuse) (a) within Stack Overflow posts, (b) between Stack Overflow and popular Java OSS projects, and (c) between the projects. Outcomes reveal that there was up to 3.3% code reuse within Stack Overflow, while 1.0% of Stack Overflow code was reused in recent popular Java projects and 2.3% in those projects that were more established. Reuse across projects was much higher, accounting for as much as 77.2%. Our outcomes have implication for strategies aimed at introducing strict quality assurance measures to ensure the appropriateness of code reuse, and licensing requirements awareness.
- Research Article
24
- 10.1145/3635711
- Mar 15, 2024
- ACM Transactions on Software Engineering and Methodology
The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.
- Research Article
47
- 10.1016/j.infsof.2020.106367
- Jun 25, 2020
- Information and Software Technology
PostFinder: Mining Stack Overflow posts to support software developers
- Conference Article
10
- 10.1109/msr52588.2021.00040
- May 1, 2021
Stack Overflow is a popular Q&A forum for soft-ware developers, providing a large number of copyable code snippets. While GitHub is a collaborative development platform, developers often reuse Stack Overflow code in their GitHub projects. These snippets get revised or edited on each platform. In this work, we study Stack Overflow posts and the code snippets that are reused from these posts in GitHub projects. We investigate and compare the change history of SO snippets with the change history of GitHub snippets. We have applied a stratified random sampling when mining 440,000 GitHub projects to create a dataset representing the change history of the reused snippets; this dataset contains 22,900 GitHub projects, 33,765 Stack Overflow references mapped to 4,634 Stack Overflow posts, and a total of 73,322 commits.We analyze the evolution patterns of snippets on each platform, compare key trends and explore the co-change of these snippets. Our results demonstrate that 76% of snippets evolve on Stack Overflow, while only 22% of the reused code snippets evolve in GitHub. Stack Overflow snippets undergo fewer and smaller changes compared to their evolving counterparts on GitHub. The evolution of snippets on both platforms is driven by the original author of the content. Finally, we found that a small percentage of snippets is co-changing across two platforms, while snippets in GitHub and Stack Overflow evolve independently of one another.
- Research Article
8
- 10.1145/3691628
- Jan 20, 2025
- ACM Transactions on Software Engineering and Methodology
Internet of Things (IoT) is defined as the connection between places and physical objects (i.e., things) over the internet/network via smart computing devices. IoT is a rapidly emerging paradigm that now encompasses almost every aspect of our modern life. As these devices differ from traditional computing, it is important to understand the challenges IoT developers face while implementing proper security measures in their IoT devices. We observed that IoT software developers share solutions to programming questions as code examples on three Stack Exchange Q & A sites: Stack Overflow (SO), Arduino, and Raspberry Pi. Previous research studies found vulnerabilities/weaknesses in C/C++ code examples shared in SO. However, the studies did not investigate C/C++ code examples related to IoT. The studies investigated SO code examples only. In this article, we conduct a large-scale empirical study of all IoT C/C++ code examples shared in the three Stack Exchange sites, i.e., SO, Arduino, and Raspberry Pi. From the 11,329 obtained code snippets from the three sites, we identify 29 distinct Common Weakness Enumeration (CWE) types in 609 snippets. These CWE types can be categorized into eight general weakness categories, and we observe that evaluation, memory, and initialization-related weaknesses are the most common to be introduced by users when posting programming solutions. Furthermore, we find that 39.58% of the vulnerable code snippets contain instances of CWE types that can be mapped to real-world occurrences of those CWE types (i.e., CVE instances). The most number vulnerable IoT code examples was found in Arduino, followed by SO, and Raspberry Pi. Memory type vulnerabilities are on the rise in the sites. For example, from the 3,595 mapped CVE instances, we find that 28.99% result in Denial of Service (DoS) errors, which is particularly harmful for network reliant IoT devices such as smart cars. Our study results can guide various IoT stakeholders to be aware of such vulnerable IoT code examples and to inform IoT researchers during their development of tools that can help prevent developers the sharing of such vulnerable code examples in the sites.
- Conference Article
19
- 10.1109/msr.2019.00047
- May 1, 2019
Stack Overflow (SO) is a popular Q&A forum for software developers, providing a large number of copyable code snippets. While GitHub is an independent code collaboration platform, developers often reuse SO code in their GitHub projects. In this paper, we investigate how often GitHub developers re-use code snippets from the SO forum, as well as what concepts they are more likely to reference in their code. To accomplish our goal, we mine SOTorrent dataset that provides connectivity between code snippets on the SO posts with software projects hosted on GitHub. We then study the characteristics of GitHub projects that reference SO posts and what popular SO discussions can be found in GitHub projects. Our results demonstrate that on average developers make 45 references to SO posts in their projects, with the highest number of references being made within the JavaScript code. We also found that 79% of the SO posts with code snippets that are referenced in GitHub code do change over time (at least ones) raising code maintainability and reliability concerns.
- Conference Article
42
- 10.1145/3338906.3341174
- Aug 12, 2019
Application Programming Interfaces (APIs) in software libraries play an important role in modern software development. Although most libraries provide API documentation as a reference, developers may find it difficult to directly search for appropriate APIs in documentation using the natural language description of the programming tasks. We call such phenomenon as knowledge gap, which refers to the fact that API documentation mainly describes API functionality and structure but lacks other types of information like concepts and purposes. In this paper, we propose a Java API recommendation tool named BIKER (Bi-Information source based KnowledgE Recommendation) to bridge the knowledge gap. We implement BIKER as a search engine website. Given a query in natural language, instead of directly searching API documentation, BIKER first searches for similar API-related questions on Stack Overflow to extract candidate APIs. Then, BIKER ranks them by considering the query’s similarity with both Stack Overflow posts and API documentation. Finally, to help developers better understand why each API is recommended and how to use them in practice, BIKER summarizes and presents supplementary information (e.g., API description, code examples in Stack Overflow posts) for each recommended API. Our quantitative evaluation and user study demonstrate that BIKER can help developers find appropriate APIs more efficiently and precisely.
- Conference Article
2
- 10.1109/scam.2019.00025
- Sep 1, 2019
Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of "useful" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of "high-quality content" in Stack Overflow.
- Conference Article
4
- 10.1109/icst53961.2022.00030
- Apr 1, 2022
Runtime Exceptions (REs) are an important class of bugs that occur frequently during code development. Traditional Automatic Program Repair (APR) tools are of limited use in this “in-development” use case, since they require a test-suite to be available as a patching oracle. Thus, developers typically tend to manually resolve their in-development REs, often by referring to technical forums, such as Stack Overflow (SO). To automate this manual process we extend our previous work, MaesTro, to provide real-time assistance to developers for repairing Java REs by recommending a relevant patch-suggesting SO post and synthesizing a repair patch from this post to fix the RE in the developer's code. Maestro exploits a library of Runtime Exception Patterns (REPs) semi-automatically mined from SO posts, through a relatively inexpensive, one-time, incremental process. An REP is an abstracted sequence of statements that triggers a given RE. REPs are used to index SO posts, retrieve a post most relevant to the RE instance exhibited by a developer's code and then mediate the process of extracting a concrete repair from the SO post, abstracting out post-specific details, and concretizing the repair to the developer's buggy code. We evaluate MaesTro on a published RE benchmark comprised of 78 instances. Maestro is able to generate a correct repair patch at the top position in 27% of the cases, within the top-3 in 40% of the cases and overall return a useful artifact in 81% of the cases. Further, the use of REPs proves instrumental to all aspects of Maestro's performance, from ranking and searching of SO posts to synthesizing patches from a given post. In particular, 45% of correct patches generated by MaesTro could not be produced by a baseline technique not using REPs, even when provided with Maestro's SO-post ranking. Maestro is also fast, needing around 1 second, on average, to generate its output. Overall, these results indicate that Maestro can provide effective real-time assistance to developers in repairing REs.
- Conference Article
3
- 10.1109/saner53432.2022.00035
- Mar 1, 2022
Selecting an appropriate library for reuse within a vast software ecosystem can be a daunting task. A list of features for each library, i.e., a short description of the functionality that can be reused with code examples that illustrate its usage, may alleviate this problem. In this paper, we propose a data-driven approach that uses both the code snippets and the accompanying natural language descriptions from Stack Overflow posts to produce a list of features of a given library. Each extracted feature corresponds to a cluster of API classes and methods considered related based on attributes of the Stack Overflow posts in which they appear. We evaluated the approach considering seven Maven libraries and compared the resulting features against library descriptions from cookbook-like tutorials. The approach achieves an average accuracy of 67% across the seven libraries for the tutorial-like features. For at least 73% of the features extracted by the approach but missing from the documentation, we found a matching library usage in a corpus of GitHub projects. These results suggest that our clusters represent library features, which paves the way to better tool support for documenting software libraries and for selecting a library in an ecosystem.
- Research Article
29
- 10.1109/tse.2021.3093761
- Sep 1, 2022
- IEEE Transactions on Software Engineering
Past studies have proposed solutions that analyze Stack Overflow content to help users find desired information or aid various downstream software engineering tasks. A common step performed by those solutions is to extract suitable representations of posts; typically, in the form of meaningful vectors. These vectors are then used for different tasks, for example, tag recommendation, relatedness prediction, post classification, and API recommendation. Intuitively, the quality of the vector representations of posts determines the effectiveness of the solutions in performing the respective tasks. In this work, to aid existing studies that analyze Stack Overflow posts, we propose a specialized deep learning architecture Post2Vec which extracts distributed representations of Stack Overflow posts. Post2Vec is aware of different types of content present in Stack Overflow posts, i.e., title, description, and code snippets, and integrates them seamlessly to learn post representations. Tags provided by Stack Overflow users that serve as a common vocabulary that captures the semantics of posts are used to guide Post2Vec in its task. To evaluate the quality of Post2Vec's deep learning architecture, we first investigate its end-to-end effectiveness in tag recommendation task. The results are compared to those of state-of-the-art tag recommendation approaches that also employ deep neural networks. We observe that Post2Vec achieves 15-25 percent improvement in terms of F1-score@5 at a lower computational cost. Moreover, to evaluate the value of representations learned by Post2Vec, we use them for three other tasks, i.e., relatedness prediction, post classification, and API recommendation. We demonstrate that the representations can be used to boost the effectiveness of state-of-the-art solutions for the three tasks by substantial margins (by 10, 7, and 10 percent in terms of F1-score, F1-score, and correctness, respectively). We release our replication package at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/maxxbw/Post2Vec</uri> .
- Conference Article
57
- 10.1109/icsme.2014.88
- Sep 1, 2014
While many tutorials, code examples, and documentation about Android APIs exist, developers still face various problems with the implementation of Android Apps. Many of these issues are discussed on QaA-sites, such as Stack Overflow. In this paper we present a manual categorization of 450 Android related posts of Stack Overflow concerning their question and problem types. The idea is to find dependencies between certain problems and question types to get better insights into issues of Android App development. The categorization is developed using card sorting with three experienced Android App developers. An initial approach to automate the classification of Stack Overflow posts using Lucene is also presented. The study highlights that the most common question types are 'How to?' and 'What is the problem?'. The problems that are discussed most often are related to 'User Interface' and 'Core Elements'. In particular, the problem category 'Layout' is often related to 'What is the problem?' and 'Frameworks' issues often come with 'Is it possible?' questions.
- Research Article
46
- 10.1016/j.infsof.2020.106277
- Feb 8, 2020
- Information and Software Technology
Mining API usage scenarios from stack overflow