Symbolic automata for representing big code

Hila Peleg,Sharon Shoham,Hongseok Yang,Eran Yahav

doi:10.1007/s00236-015-0234-1

Abstract

Analysis of massive codebases (big code) presents an opportunity for drawing insights about programming practice and enabling code reuse. One of the main challenges in analyzing big code is finding a representation that captures sufficient semantic information, can be constructed efficiently, and is amenable to meaningful comparison operations. We present a formal framework for representing code in large codebases. In our framework, the semantic descriptor for each code snippet is a partial temporal specification that captures the sequences of method invocations on an API. The main idea is to represent partial temporal specifications as symbolic automata--automata where transitions may be labeled by variables, and a variable can be substituted by a letter, a word, or a regular language. Using symbolic automata, we construct an abstract domain for static analysis of big code, capturing both the partialness of a specification and the precision of a specification. We show interesting relationships between lattice operations of this domain and common operators for manipulating partial temporal specifications, such as building a more informative specification by consolidating two partial specifications, and comparing partial temporal specifications.

Full Text