Code Reuse in Stack Overflow and Popular Open Source Java Projects
Solutions provided in Question and Answer (Q&A) websites such as Stack Overflow are regularly used in Open Source Software (OSS). However, many developers are unaware that both Stack Overflow and OSS are governed by licenses. Hence, developers reusing code from Stack Overflow for their OSS projects may violate licensing agreements if their attributions are not correct. Additionally, if code migrates from one OSS through Stack Overflow to another OSS, then complex licensing issues are likely to exist. Such forms of software reuse also have implications for future software maintenance, particularly where developers have poor understanding of copied code. This paper investigates code reuse between these two platforms (i.e., Stack Overflow and OSS), with the aim of providing insights into this issue. This study mined 151,946 Java code snippets from Stack Overflow, 16,617 Java files from 12 of the top weekly listed projects on SourceForge and GitHub, and 39,616 Java files from the top 20 most popular Java projects on SourceForge. Our analyses were aimed at finding the number of clones (indicating reuse) (a) within Stack Overflow posts, (b) between Stack Overflow and popular Java OSS projects, and (c) between the projects. Outcomes reveal that there was up to 3.3% code reuse within Stack Overflow, while 1.0% of Stack Overflow code was reused in recent popular Java projects and 2.3% in those projects that were more established. Reuse across projects was much higher, accounting for as much as 77.2%. Our outcomes have implication for strategies aimed at introducing strict quality assurance measures to ensure the appropriateness of code reuse, and licensing requirements awareness.
- Research Article
6
- 10.17485/ijst/2016/v9i21/89198
- Jun 22, 2016
- Indian Journal of Science and Technology
Objectives: The cross repository analysis between Open Source Software (OSS) and Community Question Answering (CQA) site is presented in order to speed the development process of OSS. Methods/Analysis: The OSS development is becoming popular nowadays due to fact that the source codes, the developer specifications and bug lists are made available online to the public. Anyone can contribute to the development of software by referring these files. Similarly, Stack Overflow is an interactive CQA site that caters programming related questions with their answers online and turned into repositories of software engineering knowledge. In order to track the correlation of such sites with software development tasks, we employ the two repositories to find the semantic similarity between bugs and Question and Answer (Q&A) posts posted on OSS projects and Stack Overflow respectively. The semantic similarity is analyzed by integrating the contents of the repositories based on text mining approach. The relationship between a bug and Q&A post is established through the semantic similarity and metadata features. Findings: The statistics of our analysis is presented for five OSS projects in terms of number of bugs and average bug fix time. The statistical result shows that the bug fix time can be reduced by posting the bugs into Stack Overflow. Application/Improvement: The presented approach can be utilized to find the similar Q&A posts for reported OSS bug and helps developers of OSS projects to resolve the bugs quickly by leveraging programming skills of users' in the form of Q&A posts. Keywords: Open Source Software, Community Question Answering, Stack Overflow, Cross Repository Analysis, Bug Tracking System, Bug Fixing
- Conference Article
90
- 10.1109/saner.2017.7884629
- Feb 1, 2017
Developers use Question and Answer (Q&A) websites to exchange knowledge and expertise. Stack Overflow is a popular Q&A website where developers discuss coding problems and share code examples. Although all Stack Overflow posts are free to access, code examples on Stack Overflow are governed by the Creative Commons Attribute-ShareAlike 3.0 Unported license that developers should obey when reusing code from Stack Overflow or posting code to Stack Overflow. In this paper, we conduct a case study with 399 Android apps, to investigate whether developers respect license terms when reusing code from Stack Overflow posts (and the other way around). We found 232 code snippets in 62 Android apps from our dataset that were potentially reused from Stack Overflow, and 1,226 Stack Overflow posts containing code examples that are clones of code released in 68 Android apps, suggesting that developers may have copied the code of these apps to answer Stack Overflow questions. We investigated the licenses of these pieces of code and observed 1,279 cases of potential license violations (related to code posting to Stack overflow or code reuse from Stack overflow). This paper aims to raise the awareness of the software engineering community about potential unethical code reuse activities taking place on Q&A websites like Stack Overflow.
- Research Article
1
- 10.1109/tse.2025.3572027
- Jul 1, 2025
- IEEE Transactions on Software Engineering
Developers reuse programming-related knowledge (e.g., code snippets) on Q&A sites (e.g., Stack Overflow) that functionally matches the programming problems they encounter in their development. Despite extensive research on Q&A sites, being a high-level and important type of development-related knowledge, architectural solutions (e.g., architecture tactics) and their reuse are rarely explored. To fill this gap, we conducted a mixed-methods study that includes a mining study and a survey study. For the mining study, we mined 984 commits and issues (i.e., 821 commits and 163 issues) from 893 Open-Source Software (OSS) projects on GitHub that explicitly referenced architectural solutions from Stack Overflow (SO) and Software Engineering Stack Exchange (SWESE). For the survey study, we identified practitioners involved in the reuse of these architectural solutions and surveyed 227 of them to further understand how practitioners reuse architectural solutions from Q&A sites in their OSS development. Our main findings are that: (1) OSS practitioners reuse architectural solutions from Q&A sites to solve a large variety (15 categories) of architectural problems, wherein <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Component design issue</i>, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Architectural anti-pattern</i>, and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Security issue</i> are dominant; (2) Seven categories of architectural solutions from Q&A sites have been reused to solve those problems, among which <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Architectural refactoring</i>, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Use of frameworks</i>, and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Architectural tactic</i> are the three most reused architectural solutions; (3) OSS developers often rely on ad hoc ways (e.g., informal, improvised, or unstructured approaches) to reuse architectural solutions from SO, drawing on personal experience and intuition rather than standardized or systematic practices; (4) Reusing architectural solutions from SO comes with a variety of challenges, e.g., OSS practitioners complain that they need to spend significant time to adapt such architectural solutions to address design concerns raised in their OSS development, and it is challenging to reuse architectural solutions that are not tailored to the design context of their OSS projects. Our findings pave the way for future research directions, including the design and development of approaches and tools (such as IDE plugin tools) to facilitate the reuse of architectural solutions from Q&A sites, and could also be used to offer guidelines to practitioners when they contribute architectural solutions to Q&A sites. Our dataset is publicly available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://doi.org/10.5281/zenodo.10936098</uri>.
- Conference Article
2
- 10.1109/fie.2010.5673437
- Oct 1, 2010
This panel will present several experiences in involving students in Open Source Software (OSS) projects from the perspectives of both the instructor and a member of the OSS community. OSS is growing rapidly and gaining market share in both industry (e.g., Linux and Mozilla) as well as academia (e.g, Moodle, Greenfoot, and Drupal). OSS projects have a culture built on volunteer participation to support software development. Computing degree programs desire to involve students in large-scale software projects to provide students with real-world experience and an understanding of the issues found in large, complex software projects. Involving computing students in OSS projects serves both the OSS community by providing development resources for the project while also serving the academic community by providing access to large software projects in which students can gain experience. However, the marriage of student and OSS project presents some challenges including identification of approachable OSS projects, creation of appropriate educational infrastructure, evaluation and grading, and more. Panelists will address the factors that contribute to student success in an OSS project.
- Research Article
28
- 10.1007/s12525-012-0088-0
- May 13, 2012
- Electronic Markets
This paper studies the success of Open Source Software (OSS) projects in attracting developer interest and achieving project efficiency. The focus of our study is on examining the relationship between the four sets of capabilities proposed in the Theory of Competency Rallying (TCR) and the success of OSS projects. The data collected from 607 OSS projects mainly confirm that the capabilities proposed in the TCR are necessary for the success of OSS projects. The results of this study show that in order to succeed, OSS projects should constantly identify their market’s quality and functionality needs. Ability of OSS project managers to know which developers possess certain skills required to meet a particular market need is also found to be critical. Another capability that is recognised to be crucial in predicting project success is the ability of OSS developers in effectively addressing market needs and continuously learning from such experiences. Finally, the ability of stakeholders involving in addressing a particular market need to efficiently collaborate and fulfil that specific market need is found to be another essential capability required for OSS projects to succeed. Implications of the results for practitioners and the research community are presented.
- Research Article
24
- 10.1145/3635711
- Mar 15, 2024
- ACM Transactions on Software Engineering and Methodology
The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.
- Conference Article
2
- 10.1109/scam.2019.00025
- Sep 1, 2019
Stack Overflow is the most popular question and answer website on computer programming with more than 2.5M users, 16M questions, and a new answer posted, on average, every five seconds. This wide availability of data led researchers to develop techniques to mine Stack Overflow posts. The aim is to find and recommend posts with information useful to developers. However, and not surprisingly, not every Stack Overflow post is useful from a developer's perspective. We empirically investigate what the characteristics of "useful" Stack Overflow posts are. The underlying assumption of our study is that posts that were used (referenced in the source code) in the past by developers are likely to be useful. We refer to these posts as leveraged posts. We study the characteristics of leveraged posts as opposed to the non-leveraged ones, focusing on community aspects (e.g., the reputation of the user who authored the post), the quality of the included code snippets (e.g., complexity), and the quality of the post's textual content (e.g., readability). Then, we use these features to build a prediction model to automatically identify posts that are likely to be leveraged by developers. Results of the study indicate that post meta-data (e.g., the number of comments received by the answer) is particularly useful to predict whether it has been leveraged or not, whereas code readability appears to be less useful. A classifier can classify leveraged posts with a precision of 65% and recall of 49% and non-leveraged ones with a precision of 95% and recall of 97%. This opens the road towards an automatic identification of "high-quality content" in Stack Overflow.
- Conference Article
10
- 10.1109/msr52588.2021.00040
- May 1, 2021
Stack Overflow is a popular Q&A forum for soft-ware developers, providing a large number of copyable code snippets. While GitHub is a collaborative development platform, developers often reuse Stack Overflow code in their GitHub projects. These snippets get revised or edited on each platform. In this work, we study Stack Overflow posts and the code snippets that are reused from these posts in GitHub projects. We investigate and compare the change history of SO snippets with the change history of GitHub snippets. We have applied a stratified random sampling when mining 440,000 GitHub projects to create a dataset representing the change history of the reused snippets; this dataset contains 22,900 GitHub projects, 33,765 Stack Overflow references mapped to 4,634 Stack Overflow posts, and a total of 73,322 commits.We analyze the evolution patterns of snippets on each platform, compare key trends and explore the co-change of these snippets. Our results demonstrate that 76% of snippets evolve on Stack Overflow, while only 22% of the reused code snippets evolve in GitHub. Stack Overflow snippets undergo fewer and smaller changes compared to their evolving counterparts on GitHub. The evolution of snippets on both platforms is driven by the original author of the content. Finally, we found that a small percentage of snippets is co-changing across two platforms, while snippets in GitHub and Stack Overflow evolve independently of one another.
- Research Article
- 10.2139/ssrn.3790616
- Feb 22, 2021
- SSRN Electronic Journal
Downsides of Using Inadequate Open Source Software Processes and Licenses within Standard Development Organizations
- Research Article
2
- 10.4018/joeuc.2017040103
- Apr 1, 2017
- Journal of Organizational and End User Computing
The open-source software (OSS) movement is often analogized as a commons, where products are developed by and consumed in an open community. However, does a larger commons automatically beget success or does the phenomenon fall prey to the tragedy of the commons? This research forwards and empirically investigates the curvilinear relationship between developers and OSS project quality and a project's download volume. Using segmented regression on over 12,000 SourceForge OSS projects, findings suggest an inflection point in the number of contributing developers on download volume – suggesting increasing and diminishing returns to scale from adding developers to OSS projects. Findings support the economic principle of the tragedy of the commons, a concept where an over-allocated (large number) of developers, even in an open-source environment, can lead to resource mismanagement and reduce the benefit of a public good, i.e. the OSS project.
- Research Article
14
- 10.1016/j.jss.2021.111113
- Oct 14, 2021
- Journal of Systems and Software
An analysis of open source software licensing questions in Stack Exchange sites
- Book Chapter
- 10.4018/978-1-4666-6485-2.ch005
- Jan 1, 2015
Whereas there are several instances of Open Source Software (OSS) projects that have achieved huge success in the market, a high failure rate has been reported for OSS projects. This study conducts a literature survey to gain insight into existing studies on the success of OSS projects. More specifically, this study seeks to extract the critical success factors for OSS projects. Based on the literature survey in this study, the authors found determinants of success in OSS projects and classified them into three broad categories of project traits, product traits, and network structure. These findings have important implications for both the OSS research community and OSS practitioners.
- Research Article
17
- 10.1016/j.infsof.2022.106849
- May 1, 2022
- Information and Software Technology
Collaboration in software ecosystems: A study of work groups in open environment
- Research Article
2
- 10.5753/jserd.2023.1977
- Jan 18, 2023
- Journal of Software Engineering Research and Development
Software Engineering is a crucial topic in undergraduate computing-related courses and provides the basic knowledge and skills necessary for professional practice in the software industry. Teaching Software Engineering principles, concepts, and practices and relating them to real-world scenarios are challenging tasks, and the adoption of Open Source Software (OSS) projects can help to face these challenges. On the other hand, adopting OSS projects as a didactic resource may introduce additional challenges to instructors who are not familiar with the OSS ecosystem. Objective: In this paper, we identified and mapped the profiles of instructors of Software Engineering courses concerning their classroom practices and use of OSS projects in Software Engineering Education. Method: We surveyed 90 higher education instructors in Brazil to collect data regarding their familiarity with the Software Engineering knowledge areas, pedagogical methods and resources used, and familiarity with and use of OSS projects in the classroom. Then, we resorted to data mining techniques, for instance, K-modes and Decision Tree algorithms, to identify instructors’ characteristics according to their classroom practices and use of OSS projects in the course activities. Results: Our findings include the characterization of instructors who use and instructors that do not use OSS projects in Software Engineering Education and the grouping of instructors after the application of the K-modes algorithm, and after the application of the Decision Tree algorithm, with similar characteristics of the pedagogical practices. The main result of this work is that the familiarity with OSS projects and the use of active learning methods were characteristics present in the application of the K-modes and Decision Tree algorithms, that distinguished instructors who used OSS projects from those that did not use them in Software Engineering Education. Finally, we confirmed that familiarity with OSS projects could have a positive influence on the instructors’ interest and potential for adopting this approach in Software Engineering Education.
- Research Article
10
- 10.1016/j.jss.2024.111964
- Jan 8, 2024
- Journal of Systems and Software
An empirical study of code reuse between GitHub and stack overflow during software development