Fixseeker : An Empirical Driven Graph-based Approach for Detecting Silent Vulnerability Fixes in Open Source Software
Fixseeker is a graph-based approach that detects silent vulnerability-fixing commits in open source software by leveraging correlations between code changes across multiple hunks, outperforming state-of-the-art methods with an average F1 score of 0.818 and significant improvements in other metrics across various languages and datasets.
Open source software (OSS) vulnerabilities pose significant security risks to downstream applications. While vulnerability databases provide valuable information for mitigation, many security patches are released silently in new commits of OSS repositories without explicit indications of their security impact. This makes it challenging for software maintainers and users to detect and address these vulnerabilities. There are a few approaches for detecting vulnerability-fixing commits (VFCs), but most of these approaches leverage commit messages, which would miss silent VFCs. On the other hand, there are some approaches for detecting silent VFCs based on code change patterns, but they often fail to characterize vulnerability fix patterns, thereby lacking effectiveness. For example, some approaches analyze each hunk in known VFCs, in isolation, to learn vulnerability fix patterns; but vulnerability fixes are often associated with multiple hunks, in which cases correlations of code changes across those hunks are essential for characterizing the vulnerability fixes. To address these problems, we first conduct a large-scale empirical study on 11,900 VFCs across six programming languages, in which we found that over 70% of VFCs involve multiple hunks with various types of correlations. Based on our findings, we propose Fixseeker , a graph-based approach that extracts the various correlations between code changes at the hunk level to detect silent vulnerability fixes. Our evaluation demonstrates that Fixseeker outperforms state-of-the-art approaches across multiple programming languages, achieving a high F1 score of 0.818 on average in balanced datasets and consistently improving F2 score, AUC-ROC, and AUC-PR scores by 10.64%, 5.34%, and 10.34% on imbalanced datasets compared to the best baseline methods. Our evaluation also indicates the generality of Fixseeker across different vulnerability types and repository sizes.
- Conference Article
7
- 10.1145/3650212.3680305
- Sep 11, 2024
Open-source software (OSS) vulnerabilities are increasingly prevalent, emphasizing the importance of security patches. However, in widely used security platforms like NVD, a substantial number of CVE records still lack trace links to patches. Although rank-based approaches have been proposed for security patch tracing, they heavily rely on handcrafted features in a single-step framework, which limits their effectiveness. In this paper, we propose PatchFinder, a two-phase framework with end-to-end correlation learning for better-tracing security patches. In the initial retrieval phase, we employ a hybrid patch retriever to account for both lexical and semantic matching based on the code changes and the description of a CVE, to narrow down the search space by extracting those commits as candidates that are similar to the CVE descriptions. Afterwards, in the re-ranking phase, we design an end-to-end architecture under the supervised fine-tuning paradigm for learning the semantic correlations between CVE descriptions and commits. In this way, we can automatically rank the candidates based on their correlation scores while maintaining low computation overhead. We evaluated our system against 4,789 CVEs from 532 OSS projects. The results are highly promising: PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. Moreover, the Manual Effort@10 required is curtailed to 2.77, marking a 1.94 times improvement over current leading methods. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.
- Research Article
21
- 10.4018/jossp.2010040104
- Apr 1, 2010
- International Journal of Open Source Software and Processes
Programmers often develop software in multiple languages. In an effort to study the effects of programming language fragmentation on productivity—and ultimately on a developer’s problem-solving abilities—the authors present a metric, language entropy, for characterizing the distribution of a developer’s programming efforts across multiple programming languages. This paper presents an observational study examining the project contributions of a random sample of 500 SourceForge developers. Using a random coefficients model, the authors find a statistically (alpha level of 0.001) and practically significant correlation between language entropy and the size of monthly project contributions. Results indicate that programming language fragmentation is negatively related to the total amount of code contributed by developers within SourceForge, an open source software (OSS) community.
- Conference Article
29
- 10.1109/milcom52596.2021.9652940
- Nov 29, 2021
With the increasing usage of open-source software (OSS) components, vulnerabilities embedded within them are propagated to a huge number of underlying applications. In practice, the timely application of security patches in downstream software is challenging. The main reason is that such patches do not explicitly indicate their security impacts in the documentation, which would be difficult to recognize for software maintainers and users. However, attackers can still identify these "secret" security patches by analyzing the source code and generate corresponding exploits to compromise not only unpatched versions of the current software, but also other similar software packages that may contain the same vulnerability due to code cloning or similar design/implementation logic. Therefore, it is critical to identify these secret security patches to enable timely fixes. To this end, we propose a deep learning-based defense system called PatchRNN to automatically identify secret security patches in OSS. Besides considering descriptive keywords in the commit message (i.e., at the text level), we leverage both syntactic and semantic features at the source-code level. To evaluate the performance of our system, we apply it on a large-scale real-world patch dataset and conduct a case study on a popular open-source web server software - NGINX. Experimental results show that the PatchRNN can successfully detect secret security patches with a low false positive rate.
- Supplementary Content
2
- 10.25903/5c3eb27776753
- Jan 1, 2018
Open source software (OSS) is a collaborative effort. Getting affordable high-quality software with less probability of errors or fails is not far away. Thousands of open-source projects (termed repos) are alternatives to proprietary software development. More than two-thirds of companies are contributing to open source. Open source technologies like OpenStack, Docker and KVM are being used to build the next generation of digital infrastructure. An iconic example of OSS is 'GitHub' - a successful social site. GitHub is a hosting platform that host repositories (repos) based on the Git version control system. GitHub is a knowledge-based workspace. It has several features that facilitate user communication and work integration. Through this thesis I employ data extracted from GitHub, and seek to better understand the OSS ecosystem, and to what extent each of its deployed elements affects the successful development of the OSS ecosystem. In addition, I investigate a repo's growth over different time periods to test the changing behavior of the repo. From our observations developers do not follow one development methodology when developing, and growing their project, and such developers tend to cherry-pick from differing available software methodologies. GitHub API remains the main OSS location engaged to extract the metadata for this thesis's research. This extraction process is time-consuming - due to restrictive access limitations (even with authentication). I apply Structure Equation Modelling (termed SEM) to investigate the relative path relationships between the GitHub- deployed OSS elements, and I determine the path strength contributions of each element to determine the OSS repo's activity level. SEM is a multivariate statistical analysis technique used to analyze structural relationships. This technique is the combination of factor analysis and multiple regression analysis. It is used to analyze the structural relationship between measured variables and/or latent constructs. This thesis bridges the research gap around longitude OSS studies. It engages large sample-size OSS repo metadata sets, data-quality control, and multiple programming language comparisons. Querying GitHub is not direct (nor simple) yet querying for all valid repos remains important - as sometimes illegal, or unrepresentative outlier repos (which may even be quite popular) do arise, and these then need to be removed from each initial OSS's language-specific metadata set. Eight top GitHub programming languages, (selected as the most forked repos) are separately engaged in this thesis's research. This thesis observes these eight metadata sets of GitHub repos. Over time, it measures the different repo contributions of the deployed elements of each metadata set. The number of stars-provided to the repo delivers a weaker contribution to its software development processes. Sometimes forks work against the repo's progress by generating very minor negative total effects into its commit (activity) level, and by sometimes diluting the focus of the repo's software development strategies. Here, a fork may generate new ideas, create a new repo, and then draw some original repo developers off into this new software development direction, thus retarding the original repo's commit (activity) level progression. Multiple intermittent and minor version releases exert lesser GitHub JavaScript repo commit (or activity) changes because they often involve only slight OSS improvements, and because they only require minimal commit/commits contributions. More commit(s) also bring more changes to documentation, and again the GitHub OSS repo's commit (activity) level rises. There are both direct and indirect drivers of the repo's OSS activity. Pulls and commits are the strongest drivers. This suggests creating higher levels of pull requests is likely a preferred prime target consideration for the repo creator's core team of developers. This study offers a big data direction for future work. It allows for the deployment of more sophisticated statistical comparison techniques. It offers further indications around the internal and broad relationships that likely exist between GitHub's OSS big data. Its data extraction ideas suggest a link through to business/consumer consumption, and possibly how these may be connected using improved repo search algorithms that release individual business value components.
- Conference Article
- 10.1109/apsec57359.2022.00032
- Dec 1, 2022
During software maintenance, it is often important to understand the reasons for code changes, so tools are being developed to automatically detect changes due to refactoring. Among these, RefDiff supports multiple programming languages by representing code changes (so-called diffs) by means of a language-independent abstract syntax tree containing nodes for the code parts removed and added during the change. Corresponding nodes, i.e., nodes participating in a refactoring, are matched based on text similarity, which leads to good precision, but the algorithmic limitation of computing text similarity also entail a large number of false negatives. To overcome this, we trained a neural network to classify features in diffs to be used for identifying refactorings. The main contribution of this paper is an approach for encoding differences between nodes in the syntax trees into image data for neural network matching. We have shown that the diff feature matching network not only improves the precision of RefDiff 2.0 to 98.6% and recall to 93.2%, but also is able to support detection tasks in multiple programming languages, with excellent robustness.
- Conference Article
5
- 10.1109/gcce.2016.7800475
- Oct 1, 2016
In the early years of college, students have the need to learn how to program in any programming language. It is important to see the differences between programming languages and their rules. We are proposing an interactive and self-study system for students to acquire the knowledge they need, from data structures to algorithms using multiple programming languages. The system gives the students some exercises that outputs an image. The student's image will be compared with the correct image that is in the system and the system judges the student's image will is correct or not. The students will program the exercises in the selected programming languages. Eventually the system will help students learn multiple programming languages, especially, how to solve problems regardless of programming language.
- Conference Article
2
- 10.1109/icdsba51020.2020.00098
- Sep 1, 2020
In view of the diversified characteristics of programming languages used by many current data processing algorithms, we have built an integrated platform based on Javaweb whose methods are developed by multiple programming languages. In this article, we discussed the application of some programming languages in data processing algorithms, and how to build a mixed programming environment of Java and other languages. By adopting Java and other multi-language mixed programming methods, we have carried out the design of the business logic layer of the integrated platform for the realization of multiple programming languages, which overcomes the lack of support of the Java programming language in data processing algorithm programming. It meets the needs of users who only need to use the integrated platform to quickly run different methods written in multiple languages. Application examples and results show that the method of using Java and other languages to build integrated Web projects is efficient and feasible.
- Conference Article
61
- 10.1145/3533767.3534219
- Jul 18, 2022
Automatic Program Repair (APR) aims at fixing buggy source code with less manual debugging efforts, which plays a vital role in improving software reliability and development productivity. Recent APR works have achieved remarkable progress via applying deep learning (DL), particularly neural machine translation (NMT) techniques. However, we observe that existing DL-based APR models suffer from at least two severe drawbacks: (1) Most of them can only generate patches for a single programming language, as a result, to repair multiple languages, we have to build and train many repairing models. (2) Most of them are developed offline. Therefore, they won’t function when there are new-coming requirements. \n \nTo address the above problems, a T5-based APR framework equipped with continual learning ability across multiple programming languages is proposed, namely ContInual Repair aCross Programming LanguagEs (CIRCLE). Specifically, (1) CIRCLE utilizes a prompting function to narrow the gap between natural language processing (NLP) pre-trained tasks and APR. (2) CIRCLE adopts a difficulty-based rehearsal strategy to achieve lifelong learning for APR without access to the full historical data. (3) An elastic regularization method is employed to strengthen CIRCLE’s continual learning ability further, preventing it from catastrophic forgetting. (4) CIRCLE applies a simple but effective re-repairing method to revise generated errors caused by crossing multiple programming languages. \n \nWe train CIRCLE for four languages (i.e., C, JAVA, JavaScript, and Python) and evaluate it on five commonly used benchmarks. The experimental results demonstrate that CIRCLE not only effectively and efficiently repairs multiple programming languages in continual learning settings, but also achieves state-of-the-art performance (e.g., fixes 64 Defects4J bugs) with a single repair model.
- Conference Article
29
- 10.1109/cns48642.2020.9162237
- Jun 1, 2020
With the increasing usage of open source software (OSS) in both free and proprietary applications, vulnerabilities embedded in OSS are also propagated to the underlying applications. It is critical to find security patches to fix these vulnerabilities, especially those essential to reduce security risk. Unfortunately, given a security patch, currently there does not exist a way to automatically recognize the vulnerability that is fixed. In this paper, we first conduct an empirical study on security patches by type (i.e., corresponding vulnerability type), using a large-scale dataset collected from the National Vulnerability Database (NVD). Based on analysis results, we develop a machine learning-based system to help identify the vulnerability type of a given security patch. The evaluation results show that our system achieves good performance.
- Research Article
15
- 10.1177/0037549707084490
- Jun 1, 2007
- SIMULATION
FleXible yet efficient eXecution of heterogeneous simulations benefits from concepts and methods that can support distributed simulation eXecution and independent model development. To enable formal model specification with submodels implemented in multiple programming languages, we propose a novel approach called the Shared Abstract Model (SAM) approach, which supports simulation interoperability for the class of Parallel Discrete Event System Specification (DEVS) compliant simulation models. Using this approach, models written in multiple programming languages can be eXecuted together using alternative implementations of the Parallel DEVS abstract simulator. In this paper, we describe the SAM concept, detail its specification and eXemplify its implementation with two disparate DEVS-simulation engines. We demonstrate the simplicity of integrating simulation of component models written in the programming languages Java, C++ and Visual Basic. We describe a set of illustrative eXamples that are developed in an integrated DEVSJAVA and Adevs environment. Further, we stage simulation eXperiments to investigate the eXecution performance of the proposed approach and compare it with alternatives. We conclude that application domains, in which independently-developed heterogeneous component models consistent with the Parallel DEVS formalism, benefit from a rigorous foundation and are also interoperable across different simulation engines.
- Conference Article
20
- 10.1109/saner53432.2022.00076
- Mar 1, 2022
Nowadays, vulnerabilities in open source software (OSS) are constantly emerging, posing a great threat to application security. Security patches are crucial in reducing the risk of OSS vulnerabilities. However, many of the vulnerabilities disclosed by CVE/NVD are not accompanied by security patches. Previous research has shown that the auxiliary information in CVE/NVD can aid in the matching of a vulnerability to appropriate commits. The state-of-art research proposed a rank-based approach based on the multiple dimensions of features extracted from the auxiliary information in CVE/NVD. However, this approach ignores the semantic features in the vulnerability descriptions and commit messages, making the model still have room for improvement. In this paper, we propose a novel ranking-based approach VCMATCH (Vulnerability-Commit Match). In addition to extracting the shallow statistical features between the vulnerability and the patch commit, VCMATCH extracts the deep semantic features of the vulnerability descriptions and commit messages. Besides, VCMATCH applies three classification models (i.e., XGBoost, LightGBM, CNN) and uses a voting-based rank fusion method to combine the results of the three models to generate a better result. We evaluate VCMATCH with 1,669 CVEs from 10 OSS projects. The experiment results show that VCMATCH can effectively identify security patches for OSS vulnerabilities in terms of Recall@K and Manual Effort@K, and outperforms the state-of-art model by a statistically significant margin.
- Conference Article
76
- 10.1109/saner.2016.112
- Mar 1, 2016
Nowadays, most software use multiple programming languages to implement certain functionalities based on the strengths and weaknesses of different languages. Researchers in the past have studied the impact of independent programming languages on software quality, however, there has been little or no research on the impact of multiple languages on the quality of software. Does the use of multiple languages cause more bugs? Are certain languages when used with other languages make software more bug prone? What are the relationships between multi-language usage and various bug categories? In this study, we perform a large scale empirical investigation to provide some answers to these questions. We gather a large dataset consisting of popular projects from GitHub (628 projects, 85 million SLOC, 134 thousand authors, 3 million commits, in 17 languages) to understand the impact of using multiple languages on software quality. We build multiple regression models to study the effects of using different languages on the number of bug fixing commits while controlling for factors such as project age, project size, team size, and the number of commits. Our results show that in general implementing a project with more languages has a significant effect on project quality, as it increases defect proneness. Moreover, we find specific languages that are statistically significantly more defect prone when they are used in a multi-language setting. These include popular languages like C++, Objective-C, and Java. Furthermore, we note that the use of more languages significantly increases bug proneness across all bug categories. The effect is strongest for memory, concurrency, and algorithm bugs.
- Conference Article
12
- 10.1109/issrew.2014.95
- Nov 1, 2014
During the last decade, a paradigm shift has been taken place in the software development process. Advancement in the internet technology has eased the software development under distributed environment irrespective of geographical locations. Result of this, Open Source Software systems which serve as key components of critical infrastructures in the society are still ever-expanding now. Open source software is evolved through an active participation of the users in terms of reporting of bugs, request for new features and feature improvements. These active users distributed across different geographical locations and are working towards the evolution of open source software. The code-changes due to bug fixes, new features and feature improvements for a given time period are used to predict the possible code changes in the software over a long run (potential complexity of code changes). It is evident that the open source software are evolved through these modification but an empirical understanding among the bug fix, new features, feature improvements and modifications in the files are unexplored till now. In this paper, we have predicted the potential of bugs that can be detected/fixed and new features, improvements that can be diffused in the software over a period of time. We have quantified the complexity of code changes (entropy) and after that predicted the complexity of code changes by applying Cobb-Douglas and extended Cobb-Douglas (two dimensions and three dimensions) based diffusion models. The developed models can be used to determine the quantitative value of complexity of code changes for reported bugs, new features and feature improvements in addition to their potential values. This empirical study mathematically models the interaction of a system (the debugging and code change system) with the external open world which will assist support managers in software maintenance activities and software evolution.
- Research Article
58
- 10.1145/3468854
- Sep 28, 2021
- ACM Transactions on Software Engineering and Methodology
Security patches in open source software, providing security fixes to identified vulnerabilities, are crucial in protecting against cyber attacks. Security advisories and announcements are often publicly released to inform the users about potential security vulnerability. Despite the National Vulnerability Database (NVD) publishes identified vulnerabilities, a vast majority of vulnerabilities and their corresponding security patches remain beyond public exposure, e.g., in the open source libraries that are heavily relied on by developers. As many of these patches exist in open sourced projects, the problem of curating and gathering security patches can be difficult due to their hidden nature. An extensive and complete security patches dataset could help end-users such as security companies, e.g., building a security knowledge base, or researcher, e.g., aiding in vulnerability research. To efficiently curate security patches including undisclosed patches at large scale and low cost, we propose a deep neural-network-based approach built upon commits of open source repositories. First, we design and build security patch datasets that include 38,291 security-related commits and 1,045 Common Vulnerabilities and Exposures (CVE) patches from four large-scale C programming language libraries. We manually verify each commit, among the 38,291 security-related commits, to determine if they are security related. We devise and implement a deep learning-based security patch identification system that consists of two composite neural networks: one commit-message neural network that utilizes pretrained word representations learned from our commits dataset and one code-revision neural network that takes code before revision and after revision and learns the distinction on the statement level. Our system leverages the power of the two networks for Security Patch Identification. Evaluation results show that our system significantly outperforms SVM and K-fold stacking algorithms. The result on the combined dataset achieves as high as 87.93% F1-score and precision of 86.24%. We deployed our pipeline and learned model in an industrial production environment to evaluate the generalization ability of our approach. The industrial dataset consists of 298,917 commits from 410 new libraries that range from a wide functionalities. Our experiment results and observation on the industrial dataset proved that our approach can identify security patches effectively among open sourced projects.
- Conference Article
61
- 10.1109/dsn.2019.00056
- Jun 1, 2019
Security patches in open source software (OSS) not only provide security fixes to identified vulnerabilities, but also make the vulnerable code public to the attackers. Therefore, armored attackers may misuse this information to launch N-day attacks on unpatched OSS versions. The best practice for preventing this type of N-day attacks is to keep upgrading the software to the latest version in no time. However, due to the concerns on reputation and easy software development management, software vendors may choose to secretly patch their vulnerabilities in a new version without reporting them to CVE or even providing any explicit description in their change logs. When those secretly patched vulnerabilities are being identified by armored attackers, they can be turned into powerful 0-day attacks, which can be exploited to compromise not only unpatched version of the same software, but also similar types of OSS (e.g., SSL libraries) that may contain the same vulnerability due to code clone or similar design/implementation logic. Therefore, it is critical to identify secret security patches and downgrade the risk of those 0-day attacks to at least n-day attacks. In this paper, we develop a defense system and implement a toolset to automatically identify secret security patches in open source software. To distinguish security patches from other patches, we first build a security patch database that contains more than 4700 security patches mapping to the records in CVE list. Next, we identify a set of features to help distinguish security patches from non-security ones using machine learning approaches. Finally, we use code clone identification mechanisms to discover similar patches or vulnerabilities in similar types of OSS. The experimental results show our approach can achieve good detection performance. A case study on OpenSSL, LibreSSL, and BoringSSL discovers 12 secret security patches.