Incorporating Code Structure and Quality in Deep Code Search

Hao Yu,Yin Zhang,Bin Zhang,Yuli Zhao

doi:10.3390/app12042051

Abstract

Developers usually search for reusable code snippets to improve software development efficiency. Existing code search methods, including methods based on full-text or deep learning, have two disadvantages: (1) ignoring structural information of code snippets, such as conditional statements and loop statements, and (2) ignoring quality information of code snippets, such as naming clarity and logical correctness. These disadvantages limit the performance of existing code search methods. In this paper, we propose a novel code search method named Structure and Quality based Deep Code Search (SQ-DeepCS). SQ-DeepCS introduces a code representation method called program slice to represent structual information as well as API usage of code snippets. Meanwhile, SQ-DeepCS introduces a novel deep neural network named Method-Description-Joint Embedding Neural Network (MD-JEnn) to weight the quality of code snippets. To evaluate the proposed methods, we train MD-JEnn and evaluate SQ-DeepCS by searching for code snippets with respect to the top-rated questions from Stack Overflow. We use four evaluation indicators to measure the effectiveness of SQ-DeepCS: FRank, SuccessRate@k, PrecisionRate@k, and Mean Reciprocal Rank (MRR). The experimental results show that our approach can provide better results than existing techniques when searching for relevant code snippets.

Highlights

To develop software efficiently, software developers often find and reuse existing code snippets by searching over professional codebases, such as GitHub [1,2,3,4]
We propose a novel code search method named SQ-DeepCS
The experimental results show that our approach can provide better results than existing techniques when searching for relevant code snippets

Summary

Introduction

Software developers often find and reuse existing code snippets by searching over professional codebases, such as GitHub [1,2,3,4]. Developers will submit queries expressed in natural language and expect for code snippets satisfying their needs. Code snippets and natural language queries are heterogeneous, and it is hard to locate code snippets that meet user’s intent [5]. Traditional information retrieval methods toward code searching are usually based on text vocabulary matching [6]. Lv et al [7] combined text similarity and API sequence matching and proposed an extended Boolean model named CodeHow. Linstead et al [8] proposed Sourcerer, a code search tool that combines structural information with text vocabulary information using information retrieval techniques. The performance of traditional text vocabulary-based code search methods are greatly limited

Objectives

Methods

Results

Conclusion