TypeFSL: Type Prediction from Binaries via Inter-procedural Data-flow Analysis and Few-shot Learning

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Type recovery in stripped binaries is a critical and challenging task in reverse engineering, as it is the basis for many security applications (e.g., vulnerability detection). Traditional analysis methods are limited by software complexity and emerging types in real-world projects. To address these limitations, machine learning methods have been explored. However, the existing supervised learning approaches struggle with analyzing complicated and uncommon types due to the limited availability of samples. Additionally, none of the existing works can capture fine-grained and inter-procedural features in the binaries. In this paper, we present TypeFSL, a framework that addresses the challenge of imbalanced type distributions by incorporating few-shot learning and captures inter-procedural semantics through program slicing. Moreover, based on a dataset with 3,003,117 functions, TypeFSL achieves an average of 77.9% and 84.6% accuracy across all architecture and optimizations in 20-way 5-shot and 10-shot classification tasks. Our prototype outperforms existing techniques in prediction accuracy and obfuscation resistance. Finally, the case studies demonstrate how TypeFSL predicts uncommon and complicated types in practical analysis.

Similar Papers
  • Research Article
  • Cite Count Icon 7
  • 10.1145/258916.258927
Interprocedural dataflow analysis in an executable optimizer
  • May 1, 1997
  • ACM SIGPLAN Notices
  • David W Goodwin

Interprocedural dataflow information enables link-time and post-link-time optimizers to perform analyses and code transformations that are not possible in a traditional compiler. This paper describes the interprocedural dataflow analysis techniques used by Spike, a post-linktime optimizer for Alpha/NT executables. Spike uses dataflow analysis to summarize the register definitions, uses, and kills that occur external to each routine, allowing Spike to perform a variety of optimizations that require interprocedural dataflow information. Because Spike is designed to optimize large PC applications, the time required to perform interprocedural dataflow analysis could potentially be unacceptably long, limiting Spike's effectiveness and applicability. To decrease dataflow analysis time, Spike uses a compact representation of a program's intraprocedural and interprocedural control flow that efficiently summarizes the register definitions and uses that occur in the program. Experimental results are presented for the SPEC95 integer benchmarks and eight large PC applications. The results show that the compact representation allows Spike to compute interprocedural dataflow information in less than 2 seconds for each of the SPEC95 integer benchmarks. Even for the largest PC application containing over 1.7 million instructions in 340 thousand basic blocks, interprocedural dataflow analysis requires just 12 seconds.

  • Conference Article
  • Cite Count Icon 47
  • 10.1145/258915.258927
Interprocedural dataflow analysis in an executable optimizer
  • May 1, 1997
  • David W Goodwin

Interprocedural dataflow information enables link-time and post-link-time optimizers to perform analyses and code transformations that are not possible in a traditional compiler. This paper describes the interprocedural dataflow analysis techniques used by Spike, a post-linktime optimizer for Alpha/NT executables. Spike uses dataflow analysis to summarize the register definitions, uses, and kills that occur external to each routine, allowing Spike to perform a variety of optimizations that require interprocedural dataflow information. Because Spike is designed to optimize large PC applications, the time required to perform interprocedural dataflow analysis could potentially be unacceptably long, limiting Spike's effectiveness and applicability. To decrease dataflow analysis time, Spike uses a compact representation of a program's intraprocedural and interprocedural control flow that efficiently summarizes the register definitions and uses that occur in the program. Experimental results are presented for the SPEC95 integer benchmarks and eight large PC applications. The results show that the compact representation allows Spike to compute interprocedural dataflow information in less than 2 seconds for each of the SPEC95 integer benchmarks. Even for the largest PC application containing over 1.7 million instructions in 340 thousand basic blocks, interprocedural dataflow analysis requires just 12 seconds.

  • Conference Article
  • Cite Count Icon 106
  • 10.1145/567446.567455
Interprocedural data flow analysis in the presence of pointers, procedure variables, and label variables
  • Jan 1, 1980
  • William E Weihl

Interprocedural data flow analysis is complicated by the use of procedure and label variables in programs and by the presence of aliasing among variables. In this paper we present an algorithm for computing possible values for procedure and label variables, thus providing a call graph and a control flow graph. The algorithm also computes the possible aliasing relationships in the program being analyzed.We assume that control flow information is not available to the algorithm; hence, this type of analysis may be termed flow-free analysis. Given this assumption, we demonstrate the correctness of the algorithm, in the sense that the information it produces is conservative, and show that it is as precise as possible in certain cases. We also show that the problem of determining possible values for procedure variables is P-space hard. This fact indicates that any algorithm which is precise in all cases must also run very slowly for some programs.

  • Research Article
  • 10.1145/3786763
Scaling Inter-procedural Dataflow Analysis on the Cloud
  • Dec 26, 2025
  • ACM Transactions on Programming Languages and Systems
  • Zewen Sun + 12 more

Apart from forming the backbone of compiler optimization, static dataflow analysis has been widely applied in a vast variety of applications, such as bug detection, privacy analysis, program comprehension, etc. Despite its importance, performing interprocedural dataflow analysis on large-scale programs is well known to be challenging. In this paper, we propose a novel distributed analysis framework supporting the general interprocedural dataflow analysis. Inspired by large-scale graph processing, we devise dedicated distributed worklist algorithms for both whole-program analysis and incremental analysis. We implement these algorithms and develop a distributed framework called BigDataflow running on a large-scale cluster. The experimental results validate the promising performance of BigDataflow – BigDataflow can finish analyzing the program of millions lines of code in minutes. Compared with the state-of-the-art, BigDataflow achieves much more analysis efficiency.

  • Conference Article
  • Cite Count Icon 69
  • 10.1145/1190216.1190266
Interprocedural analysis of asynchronous programs
  • Jan 17, 2007
  • Ranjit Jhala + 1 more

An asynchronous program is one that contains procedure calls which are not immediately executed from the callsite, but stored and “dispatched” in a non-deterministic order by an external scheduler at a later point. We formalize the problem of interprocedural dataflow analysis for asynchronous programs as AIFDS problems, a generalization of the IFDS problems for interprocedural dataflow analysis. We give an algorithm for computing the precise meet-over-valid-paths solution for any AIFDS instance, as well as a demand-driven algorithm for solving the corresponding demand AIFDS instances. Our algorithm can be easily implemented on top of any existing interprocedural dataflow analysis framework. We have implemented the algorithm on top of BLAST, thereby obtaining the first safety verification tool for unbounded asynchronous programs. Though the problem of solving AIFDS instances is EXPSPACE-hard, we find that in practice our technique can efficiently analyze programs by exploiting standard optimizations of interprocedural dataflow analyses.

  • Book Chapter
  • Cite Count Icon 7
  • 10.1007/978-3-642-00596-1_31
Interprocedural Dataflow Analysis over Weight Domains with Infinite Descending Chains
  • Jan 1, 2009
  • Morten Kühnrich + 3 more

We study generalized fixed-point equations over idempotent semirings and provide an efficient algorithm for the detection whether a sequence of Kleene’s iterations stabilizes after a finite number of steps. Previously known approaches considered only bounded semirings where there are no infinite descending chains. The main novelty of our work is that we deal with semirings without the boundedness restriction. Our study is motivated by several applications from interprocedural dataflow analysis. We demonstrate how the reachability problem for weighted pushdown automata can be reduced to solving equations in the framework mentioned above and we describe a few applications to demonstrate its usability.KeywordsModel CheckPolynomial Time AlgorithmWeighted PostMemory PageDescend Chain ConditionThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Research Article
  • Cite Count Icon 21
  • 10.1145/3363525
Faster Algorithms for Dynamic Algebraic Queries in Basic RSMs with Constant Treewidth
  • Nov 13, 2019
  • ACM Transactions on Programming Languages and Systems
  • Krishnendu Chatterjee + 4 more

Interprocedural analysis is at the heart of numerous applications in programming languages, such as alias analysis, constant propagation, and so on. Recursive state machines (RSMs) are standard models for interprocedural analysis. We consider a general framework with RSMs where the transitions are labeled from a semiring and path properties are algebraic with semiring operations. RSMs with algebraic path properties can model interprocedural dataflow analysis problems, the shortest path problem, the most probable path problem, and so on. The traditional algorithms for interprocedural analysis focus on path properties where the starting point is fixed as the entry point of a specific method. In this work, we consider possible multiple queries as required in many applications such as in alias analysis. The study of multiple queries allows us to bring in an important algorithmic distinction between the resource usage of the one-time preprocessing vs for each individual query. The second aspect we consider is that the control flow graphs for most programs have constant treewidth. Our main contributions are simple and implementable algorithms that support multiple queries for algebraic path properties for RSMs that have constant treewidth. Our theoretical results show that our algorithms have small additional one-time preprocessing but can answer subsequent queries significantly faster as compared to the current algorithmic solutions for interprocedural dataflow analysis. We have also implemented our algorithms and evaluated their performance for performing on-demand interprocedural dataflow analysis on various domains, such as for live variable analysis and reaching definitions, on a standard benchmark set. Our experimental results align with our theoretical statements and show that after a lightweight preprocessing, on-demand queries are answered much faster than the standard existing algorithmic approaches.

  • Research Article
  • Cite Count Icon 80
  • 10.1145/1190215.1190266
Interprocedural analysis of asynchronous programs
  • Jan 17, 2007
  • ACM SIGPLAN Notices
  • Ranjit Jhala + 1 more

An asynchronous program is one that contains procedure calls which are not immediately executed from the callsite, but stored and "dispatched" in a non-deterministic order by an external scheduler at a later point. We formalize the problem of interprocedural dataflow analysis for asynchronous programs as AIFDS problems, a generalization of the IFDS problems for interprocedural dataflow analysis. We give an algorithm for computing the precise meet-over-valid-paths solution for any AIFDS instance, as well as a demand-driven algorithm for solving the corresponding demand AIFDS instances. Our algorithm can be easily implemented on top of any existing interprocedural dataflow analysis framework. We have implemented the algorithm on top of B LAST , thereby obtaining the first safety verification tool for unbounded asynchronous programs. Though the problem of solving AIFDS instances is EXPSPACE-hard, we find that in practice our technique can efficiently analyze programs by exploiting standard optimizations of interprocedural dataflow analyses.

  • Research Article
  • Cite Count Icon 42
  • 10.1007/bf03036473
On the sequential nature of interprocedural program-analysis problems
  • Aug 1, 1996
  • Acta Informatica
  • Thomas Reps

In this paper, we study two interprocedural program-analysis problems--interprocedural slicing and interprocedural dataflow analysis-- and present the following results: These results provide evidence that there do not exist fast (N?-class) parallel algorithms for interprocedural slicing and precise interprocedural dataflow analysis (unlessP =N?). That is, it is unlikely that there are algorithms for interprocedural slicing and precise interprocedural dataflow analysis for which the number of processors is bounded by a polynomial in the size of the input, and whose running time is bounded by a polynomial in the logarithm of the size of the input. This suggests that there are limitations on the ability to use parallelism to overcome compiler bottlenecks due to expensive interprocedural-analysis computations.

  • Research Article
  • Cite Count Icon 124
  • 10.1145/960116.53995
The program summary graph and flow-sensitive interprocedual data flow analysis
  • Jun 1, 1988
  • ACM SIGPLAN Notices
  • D Callahan

This paper discusses a method for interprocedural data flow analysis which is powerful enough to express flowsensitive problems but fast enough to apply to very large programs. While such information could be applied toward standard program optimizations, the research described here is directed toward software tools for parallel programming, in which it is crucial. Many of the recent “supercomputers” can be roughly characterized as shared memory multi-processors. These include top-of-the-line systems from Cray Research and IBM, as well as multi-processor computers developed and successfully marketed by many younger companies. Development of efficient, correct programs on these machines presents new challenges to the designers of compilers, debuggers, and programming environments. Powerful analysis mechanisms have been developed for understanding the structure of programs. One such mechanism, data dependence analysis, has been evolving for many years. The product of data dependence analysis is a dota dependence gmph, a directed multi-graph that describes the interactions of program components through shared memory. Such a graph has been shown useful for a variety of applications from vectorization and parallelization to compiler management of locality. Another application of the data dependence graph is as an aid to static debugging of parallel programs. PTOOL [4] is a software system developed at Rice University to help programmers understand parallel programs. It is within this context that we at Rice have learned of the importance of interprocedural data flow analysis. I will briefly describe the PTOOL system and explain the kind of interprocedural information valuable in such an environment. PTOOL is designed to help locate interactions between

  • Conference Article
  • Cite Count Icon 89
  • 10.1145/53990.53995
The program summary graph and flow-sensitive interprocedual data flow analysis
  • Jun 1, 1988
  • D Callahan

This paper discusses a method for interprocedural data flow analysis which is powerful enough to express flowsensitive problems but fast enough to apply to very large programs. While such information could be applied toward standard program optimizations, the research described here is directed toward software tools for parallel programming, in which it is crucial. Many of the recent “supercomputers” can be roughly characterized as shared memory multi-processors. These include top-of-the-line systems from Cray Research and IBM, as well as multi-processor computers developed and successfully marketed by many younger companies. Development of efficient, correct programs on these machines presents new challenges to the designers of compilers, debuggers, and programming environments. Powerful analysis mechanisms have been developed for understanding the structure of programs. One such mechanism, data dependence analysis, has been evolving for many years. The product of data dependence analysis is a dota dependence gmph, a directed multi-graph that describes the interactions of program components through shared memory. Such a graph has been shown useful for a variety of applications from vectorization and parallelization to compiler management of locality. Another application of the data dependence graph is as an aid to static debugging of parallel programs. PTOOL [4] is a software system developed at Rice University to help programmers understand parallel programs. It is within this context that we at Rice have learned of the importance of interprocedural data flow analysis. I will briefly describe the PTOOL system and explain the kind of interprocedural information valuable in such an environment. PTOOL is designed to help locate interactions between

  • Research Article
  • Cite Count Icon 12
  • 10.1145/1286821.1286829
An improved bound for call strings based interprocedural analysis of bit vector frameworks
  • Oct 1, 2007
  • ACM Transactions on Programming Languages and Systems
  • Bageshri Karkare + 1 more

Interprocedural data flow analysis extends the scope of analysis across procedure boundaries in search of increased optimization opportunities. Call strings based approach is a general approach for performing flow and context sensitive interprocedural analysis. It maintains a history of calls along with the data flow information in the form of call strings, which are sequences of unfinished calls. Recursive programs may need infinite call strings for interprocedural data flow analysis. For bit vector frameworks this method is believed to require all call strings of lengths up to 3 K , where K is the maximum number of distinct call sites in any call chain. We combine the nature of information flows in bit-vector data flow analysis with the structure of interprocedurally valid paths to bound the call strings. Instead of bounding the length of call strings, we bound the number of occurrences of any call site in a call string. We show that the call strings in which a call site appears at most three times, are sufficient for convergence on interprocedural maximum fixed point solution. Though this results in the same worst case length of call strings, it does not require constructing all call strings up to length 3 K . Our empirical measurements on recursive programs show that our bound reduces the lengths and the number of call strings, and hence the analysis time, significantly.

  • Conference Article
  • 10.1109/ispa-bdcloud-socialcom-sustaincom52081.2021.00184
Accelerating Data-Flow Analysis with Full-Partitioning
  • Sep 1, 2021
  • Yuantong Zhang + 5 more

Data-flow analysis is a classical way to deal with program optimization and program analysis issues. However, the classical iterative data-flow analysis prone to low efficiency when applied to vulnerability detection, because more exhaustive information is required. Therefore, we propose the full-partitioned interprocedural data-flow analysis. In this way, all works to a program are carried out to procedures strictly. We also introduce the novel Pointee Objects Intermediate Representation object to replace the real pointees during interprocedural pointer analysis. It aims to solve the representation of pointee objects when interprocedural pointer analysis is full-partitioned. The interprocedural data-flow analysis is realized by using the function summary. We have observed a significant increase in efficiency and a good capability to support the use-after-free detection.

  • Book Chapter
  • Cite Count Icon 47
  • 10.1007/11688839_2
Interprocedural Dataflow Analysis in the Presence of Large Libraries
  • Jan 1, 2006
  • Atanas Rountev + 2 more

Interprocedural dataflow analysis has a large number of uses for software optimization, maintenance, testing, and verification. For software built with reusable components, the traditional approaches for whole-program analysis cannot be used directly. This paper considers component-level analysis of a main component which is built on top of a pre-existing library component. We propose an approach for computing summary information for the library and for using it to analyze the main component. The approach defines a general theoretical framework for dataflow analysis of programs built with large extensible library components, using pre-computed summary functions for library-local execution paths. Our experimental results indicate that the cost of component-level analysis could be substantially lower than the cost of the corresponding whole-program analysis, without any loss of precision. These results present a promising step towards practical analysis techniques for large-scale software systems built with reusable components.KeywordsStart NodeLarge LibraryCall GraphReusable ComponentLibrary ComponentThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1007/978-3-540-88140-7_19
Automatic Transformation for Overlapping Communication and Computation
  • Jan 1, 2008
  • Changjun Hu + 3 more

Message-passing is a predominant programming paradigm for distributed memory systems. RDMA networks like infiniBand and Myrinet reduce communication overhead by overlapping communication with computation. For the overlap to be more effective, we propose a source-to-source transformation scheme by automatically restructuring message-passing codes. The extensions to control-flow graph can accurately analyze the message-passing program and help perform data-flow analysis effectively. This analysis identifies the minimal region between producer and consumer, which contains message-passing functional calls. Using inter-procedural data-flow analysis, the transformation scheme enables the overlap of communication with computation. Experiments on the well-known NAS Parallel Benchmarks show that for distributed memory systems, versions employing communication-computation overlap are faster than original programs.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant