Machine Learning as a Mean to Uncover Latent Knowledge from Source Code

Ervina Çergani

doi:10.25534/tuprints-00011658

Abstract

Becoming increasingly complex, software development relies heavily on the reuse of existing libraries. Such libraries expose their functionality through Application Programming Interfaces (APIs) for developers to interact with, as effective means for code reuse. However, developers using an API must be aware of how to efficiently and correctly use it in their development tasks in order to deliver simple, clear, comprehensive and correct software. To assist developers work with APIs more efficiently, a family of developer- assistance tools known as Recommender Systems for Software Engineering (RSSEs) have shown to be useful in increasing programmers’ productivity. Applications of RSSEs are based on learning API usage patterns by analyzing source code. In reaction to this, many approaches have been proposed for learning API usage patterns from code repositories. However, a major challenge in these approaches is the discovery of latent knowledge in source code. Current approaches heavily rely on program analyses that predefine the learning process, and then use different algorithms to aggregate the detailed information extracted from source code. On this thesis, we aim to redirect the focus on using advanced machine learning tools to uncover latent knowledge in source code. Machine learning algorithms are known to use general input formats, are fully automated and work well across different domains. Therefore, to investigate the advantages of machine learning approaches and their potential in software engineering, we consider two different dimensions. First, we use the same program analyses as used by a state of the art method call recommender, and investigate if replacing the existing learning approach (canopy clustering) with a more powerful machine learning algorithm (Boolean Matrix Factorization - BMF), discovers additional knowledge that was not possible with the previous approach. We find that BMF is indeed able to automatically discover the number of clusters to represent the object usage space, and identifies corner cases (noise) in the data, while reducing model size and improving inference speed without compromising prediction quality. Second, we use an event stream mining algorithm that can automatically learn different code representations (pattern types), without complex domain knowledge needed to encode a-priori. We evaluate the quality of the learned patterns on the application context of misuse detection, and compare its performance with five state of the art misuse detectors. Our evaluation results show that the patterns learned perform better in terms of precision by ranking true positives higher in the top findings, and in terms of recall by being able to detect more misuses in the source code. Our results show practical evidence of the positive impact that machine learning tools can bring to the field of software engineering, in terms of automatically discover latent knowledge in source code, and their comparability (or even better) performance with respect to state of the art approaches.

Full Text