Pre-trained Model-based Actionable Warning Identification: A Feasibility Study

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Actionable Warning Identification (AWI) plays a pivotal role in improving the usability of Static Code Analyzers (SCAs). Currently, Machine Learning (ML)-based AWI approaches, which mainly learn an AWI classifier from labeled warnings, are notably common. However, these approaches still face the problem of restricted performance due to the direct reliance on a limited number of labeled warnings to develop a classifier. Very recently, Pre-Trained Models (PTMs), which have been trained through billions of text/code tokens and have demonstrated substantial successful applications in various code-related tasks, could potentially address the above problem. Nevertheless, the performance of PTMs on AWI has not been systematically investigated, leaving a gap in understanding their pros and cons. In this paper, we are the first to explore the feasibility of applying various PTMs for AWI. By conducting an extensive evaluation on 12K+ warnings involving four commonly used SCAs (i.e., SpotBugs, Infer, CppCheck, and CSA) and three typical programming languages (i.e., Java, C, and C++), we (1) investigate the overall PTM-based AWI performance compared to the state-of-the-art ML-based AWI approach, (2) analyze the impact of three primary aspects (i.e., data preprocessing, model training, and model prediction) in the typical PTM-based AWI workflow, and (3) identify the reasons for the current underperformance of PTMs on AWI, thereby obtaining a series of findings. Based on the above findings, we further provide several potential directions to enhance PTM-based AWI.

Similar Papers
  • Research Article
  • 10.36994/2788-5518-2023-01-05-20
КОМБІНОВАНІ ПІДХОДИ ДО СТАТИЧНОГО АНАЛІЗУ КОДУ З ВИКОРИСТАННЯМ НЕЙРОННИХ МЕРЕЖ
  • Jan 1, 2023
  • Інфокомунікаційні та комп’ютерні технології
  • Illia Vokhranov + 1 more

This article presents an overview of possible approaches to the application of neural networks in the process of static code analysis. It explores the current state of affairs in existing approaches to improving program analysis using machine learning methods, including postprocessing of static analysis alerts, preprocessing of source code, or direct use of machine learning for analyzing source code. Additionally, the article examines the main directions for applying approaches from each category. Both classical approaches and machine learning methods in program analysis possess distinct strengths and weaknesses that should be considered when implementing them in practice. One of the main theses of this research is that understanding the capabilities of combining these approaches, leveraging the flexibility offered by neural networks while maintaining a sufficient level of reliability provided by classical algorithms, is crucial for building a high-quality system. This article covers the following three basic directions of the application of neural networks for the static source code analysis. The first direction is a specification tuning: a refinement of specifications produced by a ‘classic’ static code analyzer (a removal, clustering, ranking of warnings or just assistance in manual warning analysis, etc.). The second direction is a specification inference, to find specifications hidden in code (feature extraction, selection, or code transformation retaining its behaviour, e.g. to make it more suitable for the ‘classic’ static analysis tools). The third way is a black box analysis to discover and fix code defects (syntactic, semantic ones or vulnerabilities), to assist in manual code checking, to format the code automatically or to find code smells (in this direction only a machine learning model is used, its training is performed on the source code directly). The article outlines directions for the future research which will focus on the development and combining of the approaches covered here.

  • Research Article
  • Cite Count Icon 1
  • 10.18429/jacow-icalepcs2017-thpha160
JACoW : Experience with static PLC code analysis at CERN
  • Feb 20, 2018
  • Christina Tsiplaki Spiliopoulou + 2 more

The large number of industrial control systems based on PLCs (Programmable Logic Controllers) available at CERN implies a huge number of programs and lines of code. The software quality assurance becomes a key point to ensure the reliability of the control systems. Static code analysis is a relatively easy-to-use, simple way to find potential faults or error-prone parts in the source code. While static code analysis is widely used for general purpose programming languages (e.g. Java, C), this is not the case for PLC program languages. We have analyzed the possibilities and the gains to be expected from applying static analysis to the PLC code used at CERN, based on the UNICOS framework. This paper reports on our experience with the method and the available tools and sketches an outline for future work to make this analysis method practically applicable.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/icodse53690.2021.9648519
Static Code Analysis Tool for Laravel Framework Based Web Application
  • Nov 3, 2021
  • Ranindya Paramitha + 1 more

To increase and maintain web application security, developers could use some different methods, one of them is static code analysis. This method could find security vulnerabilities inside a source code without the need of running the program. It could also be automated by using tools, which considered more efficient than manual reviews. One specific method which is commonly used in static code analysis is taint analysis. Taint analysis usually utilizes source code modeling to prepare the code for analysis process to detect any untrusted data flows into security sensitives computations. While this kind of analysis could be very helpful, static code analysis tool for Laravel-based web application is still quite rare, despite its popularity. Therefore, in this research, we want to know how static code (taint) analysis could be utilized to detect security vulnerabilities and how the projects (Laravel-based) should be modeled in order to facilitate this analysis. We then developed a static analysis tool, which models the application’s source code using AST and dictionary to be used as the base of the taint analysis. The tool first parsed the route file of Laravel project to get a list of controller files. Each file in that list would be parsed in order to build the source code representation, before actually being analyzed using taint analysis method. The experiments was done using this tool shows that the tools (with taint analysis) could detect 13 security vulnerabilities from 6 Laravel-based projects with one False Negative. An ineffective sanitizer was the suspected cause of this False Negative. This also shows that proposed modeling technique could be helpful in facilitating taint analysis in Laravel-based projects. For future development and studies, this tool should be tested with more Laravel and even other framework based web application with a wider range of security vulnerabilities.

  • Research Article
  • Cite Count Icon 1
  • 10.24193/subbi.2023.1.03
Detecting Programming Flaws in Student Submissions with Static Source Code Analysis
  • Jul 20, 2023
  • Studia Universitatis Babeș-Bolyai Informatica
  • Péter Kaszab + 1 more

"Static code analyzer tools can detect several programming mistakes, that would lead to run-time errors. Such tools can also detect violations of the conventions and guidelines of the given programming language. Thus, the feedback provided by these tools can be valuable for both students and instructors in computer science education. In our paper, we evaluated over 5000 student submissions from the last two years written in C++ and C# programming languages at Eotvos Lorand University Faculty of Informatics (Budapest, Hungary), by executing various static code analyzers on them. From the findings of the analyzers, we highlight some of the most typical and serious issues. Based on these results, we argue to include static analysis of programming submissions in automated and assisted semi-automatic evaluating and grading systems at universities, as these could increase the quality of programming assignments and raise the attention of students on various otherwise missed bugs and other programming errors. 2010 Mathematics Subject Classification. 68U99, 68Q55, 97Q70. 1998 CR Categories and Descriptors. F.3.2 [Theory of Computation]: Logics and Meanings of Programs – Semantics of Programming Languages; D.3.4 [Software]: Programming Languages – Processors; K.3.2 [Computing Milieux]: Computers and Education – Computer and Information Science Education. Key words and phrases. static code analysis, C++, C#, student submission, computer science education, programming flaw."

  • Research Article
  • 10.11591/ijeecs.v35.i1.pp665-672
Method level static source code analysis on behavioral change impact analysis in software regression testing
  • Jul 1, 2024
  • Indonesian Journal of Electrical Engineering and Computer Science
  • Fredrick Mugambi Muthengi + 3 more

Though a myriad of changes take place in a software system during maintenance, behavioral changes carry the bulk of the reasons of software modifications. In assessing the impact of the changes made in software, static source code analysis plays a key role. However, static source code analysis can be a little complex depending on the reason for the expedition. Despite the work done so far, little focus has been made on the potential of changed methods analysis during static source code analysis in assessing the impact of the changes made in a software system. We propose and investigate a static source code analysis technique that would generate information on the modified methods in the source code. This study analyzes four aThough a myriad of changes take place in a software system during maintenance, behavioral changes carry the bulk of the reasons for software modifications. In assessing the impact of the changes made in the software, static source code analysis can be a little complex depending on the reason for the expedition. Despite the works done so far, little focus has been directed on the potential of changed methods during static source code analysis, in assessing the impact of the changes made in software. This study investigates a method-level static source code analysis technique that would generate information on the methods affected by changes made in the software. The work analyzed three Java projects. The results indicate an improvement in leveraging on the knowledge of edited methods in change impact assessment during regression testing. The approach enhances code review efforts in light of assessing operational behavior impacted by the changes made.Java projects and shows that an analysis of the changed methods reveals the level of regression testing that ought to be conducted for the changes made.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/ai4i51902.2021.00030
Efficient Binary Static Code Data Flow Analysis Using Unsupervised Learning
  • Sep 1, 2021
  • James Obert + 1 more

The ever increasing need to ensure that code is reliably, efficiently and safely constructed has fueled the evolution of popular static binary code analysis tools. In identifying potential coding flaws in binaries, tools such as IDA Pro are used to disassemble the binaries into an opcode/assembly language format in support of manual static code analysis. Because of the highly manual and resource intensive nature involved with analyzing large binaries, the probability of overlooking potential coding irregularities and inefficiencies is quite high. In this paper, a light-weight, unsupervised data flow methodology is described which uses highly-correlated data flow graph (CDFGs) to identify coding irregularities such that analysis time and required computing resources are minimized. Such analysis accuracy and efficiency gains are achieved by using a combination of graph analysis and unsupervised machine learning techniques which allows an analyst to focus on the most statistically significant flow patterns while performing binary static code analysis.

  • Research Article
  • Cite Count Icon 2
  • 10.1142/s1793351x2220001x
Efficient Binary Static Code Data Flow Analysis Using Unsupervised Learning
  • Aug 15, 2022
  • International Journal of Semantic Computing
  • James Obert + 1 more

The ever-increasing need to ensure that code is reliably, efficiently and safely constructed has fueled the evolution of popular static binary code analysis tools. In identifying potential coding flaws in binaries, tools such as IDA Pro are used to disassemble the binaries into an opcode/ assembly language format in support of manual static code analysis. Because of the highly manual and resource-intensive nature involved with analyzing large binaries, the probability of overlooking potential coding irregularities and inefficiencies is quite high. In this paper, a light-weight, unsupervised data flow methodology is described which uses highly correlated data flow graph (CDFGs) to identify coding irregularities such that analysis time and required computing resources are minimized. Such analysis accuracy and efficiency gains are achieved by using a combination of graph analysis and unsupervised machine learning techniques which allows an analyst to focus on the most statistically significant flow patterns while performing binary static code analysis.

  • Conference Article
  • Cite Count Icon 3
  • 10.13016/m2qqac-ik0h
Analyzing False Positive Source Code Vulnerabilities Using Static Analysis Tools
  • Dec 1, 2018
  • Foteini Cheirdari + 1 more

Static source code analysis for the detection of vulnerabilities may generate a huge amount of results making it difficult to manually verify all of them. In addition, static code analysis yields a large number of false positives. Consequently, software developers may ignore the results of static code analysis. This paper analyzes the results of static code analysis tools to identify false positive trends per tool. The novel idea is to assist developers and analysts identify the likelihood of a finding to be an actual true positive. This paper proposes an algorithm that makes use of a new critical feature, a personal identifier, which assists labeling the findings correctly as true or false. Experiments verified identification of true positives with a higher level of accuracy.

  • Book Chapter
  • 10.1007/978-3-642-28305-5_12
A Set of Java Metrics for Software Quality Tree Based on Static Code Analyzers
  • Jan 1, 2012
  • Ciprian-Bogdan Chirilă + 1 more

Assessing software quality allows cost cuts from the early development stages. Software quality information helps taking development decisions, checking fault corrections effect, estimating maintenance effort. Our fault density based quality model relies on static source code analyzers and on a set of language specific metrics. We compute the fault ratio for each static analyzer rule. Giving user defined weights to fault ratios we can quantify quality as a number. We identified, described informally and implemented in a prototype a set of Java metrics in order to fulfill our model and to accomplish our quality assessment goal.

  • Research Article
  • Cite Count Icon 60
  • 10.1016/j.procs.2020.04.217
A Comparative Study of Static Code Analysis tools for Vulnerability Detection in C/C++ and JAVA Source Code
  • Jan 1, 2020
  • Procedia Computer Science
  • Arvinder Kaur + 1 more

A Comparative Study of Static Code Analysis tools for Vulnerability Detection in C/C++ and JAVA Source Code

  • Conference Article
  • Cite Count Icon 10
  • 10.1145/3590777.3590780
New Tricks to Old Codes: Can AI Chatbots Replace Static Code Analysis Tools?
  • Jun 14, 2023
  • Omer Said Ozturk + 4 more

The prevalence and significance of web services in our daily lives make it imperative to ensure that they are – as much as possible – free from vulnerabilities. However, developing a complex piece of software free from any security vulnerabilities is hard, if not impossible. One way to progress towards achieving this holy grail is by using static code analysis tools to root out any common or known vulnerabilities that may accidentally be introduced during the development process. Static code analysis tools have significantly contributed to addressing the problem above, but are imperfect. It is conceivable that static code analysis can be improved by using AI-powered tools, which have recently increased in popularity. However, there is still very little work in analysing both types of tools’ effectiveness, and this is a research gap that our paper aims to fill. We carried out a study involving 11 static code analysers, and one AI-powered chatbot named ChatGPT, to assess their effectiveness in detecting 92 vulnerabilities representing the top 10 known vulnerability categories in web applications, as classified by OWASP. We particularly focused on PHP vulnerabilities since it is one of the most widely used languages in web applications. However, it has few security mechanisms to help its software developers. We found that the success rate of ChatGPT in terms of finding security vulnerabilities in PHP is around 62-68%. At the same time, the best traditional static code analyser tested has a success rate of 32%. Even combining several traditional static code analysers (with the best features on certain aspects of detection) would only achieve a rate of 53%, which is still significantly lower than ChatGPT’s success rate. Nonetheless, ChatGPT has a very high false positive rate of 91%. In comparison, the worst false positive rate of any traditional static code analyser is 82%. These findings highlight the promising potential of ChatGPT for improving the static code analysis process but reveal certain caveats (especially regarding accuracy) in its current state. Our findings suggest that one interesting possibility to explore in future works would be to pick the best of both worlds by combining traditional static code analysers with ChatGPT to find security vulnerabilities more effectively.

  • Book Chapter
  • Cite Count Icon 28
  • 10.1007/978-3-540-85563-7_50
Ontology-Based Design Pattern Recognition
  • Sep 3, 2008
  • Damir Kirasić + 1 more

This paper presents ontology-based architecture for pattern recognition in the context of static source code analysis. The proposed system has three subsystems: parser, OWL ontologies and analyser. The parser subsystem translates the input code to AST that is constructed as an XML tree. The OWL ontologies define code patterns and general programming concepts. The analyser subsystem constructs instances of the input code as ontology individuals and asks the reasoner to classify them. The experience gained in the implementation of the proposed system and some practical issues are discussed. The recognition system successfully integrates the knowledge representation field and static code analysis, resulting in greater flexibility of the recognition system.

  • Conference Article
  • 10.1109/compsac.2019.00080
Integrating Static Code Analysis Toolchains
  • Jul 1, 2019
  • Matthias Kern + 6 more

This paper proposes an approach for a tool-agnostic and heterogeneous static\ncode analysis toolchain in combination with an exchange format. This approach\nenhances both traceability and comparability of analysis results. State of the\nart toolchains support features for either test execution and build automation\nor traceability between tests, requirements and design information. Our\napproach combines all those features and extends traceability to the source\ncode level, incorporating static code analysis. As part of our approach we\nintroduce the "ASSUME Static Code Analysis tool exchange format" that\nfacilitates the comparability of different static code analysis results. We\ndemonstrate how this approach enhances the usability and efficiency of static\ncode analysis in a development process. On the one hand, our approach enables\nthe exchange of results and evaluations between static code analysis tools. On\nthe other hand, it enables a complete traceability between requirements,\ndesigns, implementation, and the results of static code analysis. Within our\napproach we also propose an OSLC specification for static code analysis tools\nand an OSLC communication framework.\n

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 11
  • 10.3390/technologies9010003
Enhanced Bug Prediction in JavaScript Programs with Hybrid Call-Graph Based Invocation Metrics
  • Dec 30, 2020
  • Technologies
  • Gábor Antal + 3 more

Bug prediction aims at finding source code elements in a software system that are likely to contain defects. Being aware of the most error-prone parts of the program, one can efficiently allocate the limited amount of testing and code review resources. Therefore, bug prediction can support software maintenance and evolution to a great extent. In this paper, we propose a function level JavaScript bug prediction model based on static source code metrics with the addition of a hybrid (static and dynamic) code analysis based metric of the number of incoming and outgoing function calls (HNII and HNOI). Our motivation for this is that JavaScript is a highly dynamic scripting language for which static code analysis might be very imprecise; therefore, using a purely static source code features for bug prediction might not be enough. Based on a study where we extracted 824 buggy and 1943 non-buggy functions from the publicly available BugsJS dataset for the ESLint JavaScript project, we can confirm the positive impact of hybrid code metrics on the prediction performance of the ML models. Depending on the ML algorithm, applied hyper-parameters, and target measures we consider, hybrid invocation metrics bring a 2–10% increase in model performances (i.e., precision, recall, F-measure). Interestingly, replacing static NOI and NII metrics with their hybrid counterparts HNOI and HNII in itself improves model performances; however, using them all together yields the best results.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.18523/2617-3808.2020.3.27-30
Building and Storing in Graph Database Neo4j Abstract Symantic Graph of PHP Applications Source Code
  • Dec 28, 2020
  • NaUKMA Research Papers. Computer Science
  • Trokhym Babych + 1 more

Static code analysis is a very important stage in the development and implementation of software, and it needs to be used to obtain a better code. The most complicated in static analysis is the analysis of source code, and further analysis. Analysis of the source code can be done multiple times and expand the set of necessary grammar. The main goal was and remains to provide a solution to reduce the time needed for a global reassessment of static analysis at code level after a change.The main problem when using static code analysis is to build an abstract semantic graph, because each software solution is provided with a separate data warehouse. The proposed solution proposes to use as a database repository using graphs. Thus, the storage mechanisms of the created abstract semantic graph have been simplified, which in turn, in addition to increasing the clarity of the information we store, provides convenient ways to further work with the stored information.The developed solution can transform a rather large source repository into a graphic representation and maintain it later. It is found that the approach is suitable for carrying out code compliance checks and for performing static analysis tests on a graphic representation. This approach also uses advanced file-level detailing, accelerating static analysis. Based on my measurements, the frameworks are fast enough to help their users quickly change the repository of codes.This article confirms the thesis about the possibility of storing an abstract semantic graph in a graph database, and after refinement, if it contains sufficient transformations and requests for language processing, can become a complete transport for communication between various static analysis tools that usually perform one of two the functions of verification either for quality or for vulnerability, thereby making a unified creation of an abstract semantic graph.As an improvement it is necessary to consider the possibility of incremental analysis – the analysis of changes in the code, in order to minimize resource costs for a rather resource intensive operation of the structure of the abstract semantic graph.Manuscript received 09.06.2020

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.