In-Memory Database Support for Source Code Search and Analytics

Oleksandr Panchenko

doi:10.1109/wcre.2011.60

Abstract

Software engineers are coerced to deal with a large amount of information about source code. Appropriate tools could assist to handle it, but existing tools are not capable of processing and presenting such a large amount of information sufficiently. With the advent of in-memory column-oriented databases the performance of some data-intensive applications could be significantly improved. This has resulted in a completely new user experience of those applications and enabled new use-cases. This PhD thesis investigates the applicability of in-memory column-oriented databases for supporting daily software engineering activities. The major research question addressed in this thesis is as follows: does in-memory column-oriented database technology provide the necessary performance advantages for working interactively with large amounts of fine-grained structural information about source code? To investigate this research question two scenarios have been selected that particularly suffer from low performance. The first selected scenario is source code search. Existing source code repositories contain a large amount of structural data. Interface definitions, abstract syntax trees, and call graphs are examples of such structural data. Existing tools have solved the performance problems either by reducing the amount of data because of using a coarse-grained representation, or by preparing answers to developers' questions in advance, or by reducing the scope of search. All currently existing alternatives result in the loss of developers' productivity. The second scenario is source code analytics. To complete reverse engineering tasks software engineers often are required to analyze a number of atomic facts that have been extracted from source code. Examples of such atomic facts are occurrences of certain syntactic patterns in code, software product metrics or violations of development guidelines. Each fact typically has several characteristics, such as the type of the fact, the location in code where found, and some attributes. Particularly, analysis of large software systems requires the ability to process a large amount of such facts efficiently. During industrial experiments conducted for this thesis it was evidenced that in-memory technology provides performance gains that improve developers' productivity and enable scenarios previously not possible. This thesis overlaps both software engineering and database technology. From the viewpoint of software engineering, it seeks to find a way to support developers in dealing with a large amount of structural data. From the viewpoint of database technology, source code search and analytics are domains for studying fundamental issues of storing and querying structural data.

Full Text