Abstract

Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Highlights

  • Code similarity is often used to measure the similarity degree between a piece of code snippets on text, syntax and semantic

  • We propose a siamese neural network that extracts semantic features by utilizing the similarity among source codes and makes code snippets with similar function mapped into similar vectors

  • We evaluate our approach on a dataset that is collected from a programming open judge (OJ)

Read more

Summary

Introduction

Code similarity is often used to measure the similarity degree between a piece of code snippets on text, syntax and semantic. Several efforts have been made for finding similar codes for each given code snippet. Manual defined or hand-crafted features, e.g., by analyzing the overlap among identifiers, operators, operands, lines of code, functions, types, constants and other attributes or comparing the abstract syntax trees of two code snippets, are conducted for small code snippets. This method is a coarse-grained measurement with low accuracy [8]. The similarity of codes is measured by string matching [9], suffix tree matching [10], graph matching [11] and other algorithms

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call