Abstract

Code search is a common activity in software development, and code-to-code search can benefit in a wide range of use-case scenarios. Code-to-code search uses a code fragment as the query for searching similar code fragments from large corpora. The results of a search can be applied to some software engineering tasks, such as search-based code recommendation, data-driven program repairing, and software plagiarism detection. To be put into daily use, the code-to-code search needs to find similar code fragments accurately and efficiently in a large dataset. Some search engines can locate exactly similar code, but are not able to search syntactical clones. Therefore, we propose ASTENS-BWA, a novel approach for searching syntactic similar code regions between code fragments via a tree-based sequence alignment. Source code has been transformed into a tree-based sequence that contains the structure information, and a sequence alignment algorithm has been applied to find similar regions. We evaluate ASTENS-BWA on three different tasks, the results demonstrate that our approach can find syntactical similar regions for programming code and retrieve similar code fragments fast and with high accuracy. As a code clone detection tool, ASTENS-BWA can report clone pairs in a high recall, but it needs manually check to reduce the false alarms. ASTENS-BWA is scalable and can report cloned code fragments in seconds for a code corpus of million lines of code. • Novel code clone search approach that scales to large source code repositories. • Quick searching syntactical clones between code fragments via tree-based representation. • Code clone search approach that can outperform state-of-the-art techniques. • Application of the proposed approach towards code recommendation and clone detection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call