Advancing Static Code Analysis With Language-Agnostic Component Identification

Vincent Bushong,Tomas Cerny,Jacob Curtis,Micah Schiewe

doi:10.1109/access.2022.3160485

Abstract

Static code analysis of software systems has proven beneficial for a broad range of domains, including security assessments, coding practice, error detection, and others. However, as modern systems have grown in complexity and heterogeneity over the past few decades, advances in development frameworks have dominated. Rather than involving low-level language constructs, these frameworks typically focus on software components, including data entities, controllers, and endpoints. As a result, current code analysis approaches have become unsuitable for analyzing these modern systems due to their focus on low-level constructs in a single language. Thus, code analysis has become a far more complicated endeavor thanks to the plethora of languages, frameworks, and design approaches in modern software development. This paper presents a novel approach to solving the problem of being tied to a single language and its low-level constructs. The system’s source code is transformed into an intermediate representation called a language-agnostic abstract-syntax tree. This system representation is then assessed by generalized component parsers that extract relevant high-level information, such as components, from low-level structures. The design of the approach is presented here in detail, along with its evaluation in a case study involving two large, heterogeneous, cloud-native system benchmarks (Java and C++ microservices). The study demonstrates a unified identification approach to determine system data entities and endpoints. Utilizing higher-level constructs, such as components, can advance the current practice of system analysis to better face broader problems introduced by modern system development practices.

Full Text