A Source Code Similarity Based on Siamese Neural Network

Chunli Xie,Mengqi Wang,Cheng Qian,Xia Wang

doi:10.3390/app10217519

Abstract

Finding similar code snippets is a fundamental task in the field of software engineering. Several approaches have been proposed for this task by using statistical language model which focuses on syntax and structure of codes rather than deep semantic information underlying codes. In this paper, a Siamese Neural Network is proposed that maps codes into continuous space vectors and try to capture their semantic meaning. Firstly, an unsupervised pre-trained method that models code snippets as a weighted series of word vectors. The weights of the series are fitted by the Term Frequency-Inverse Document Frequency (TF-IDF). Then, a Siamese Neural Network trained model is constructed to learn semantic vector representation of code snippets. Finally, the cosine similarity is provided to measure the similarity score between pairs of code snippets. Moreover, we have implemented our approach on a dataset of functionally similar code. The experimental results show that our method improves some performance over single word embedding method.

Highlights

Code similarity is often used to measure the similarity degree between a piece of code snippets on text, syntax and semantic
We propose a siamese neural network that extracts semantic features by utilizing the similarity among source codes and makes code snippets with similar function mapped into similar vectors
We evaluate our approach on a dataset that is collected from a programming open judge (OJ)

Summary

Introduction

Code similarity is often used to measure the similarity degree between a piece of code snippets on text, syntax and semantic. Several efforts have been made for finding similar codes for each given code snippet. Manual defined or hand-crafted features, e.g., by analyzing the overlap among identifiers, operators, operands, lines of code, functions, types, constants and other attributes or comparing the abstract syntax trees of two code snippets, are conducted for small code snippets. This method is a coarse-grained measurement with low accuracy [8]. The similarity of codes is measured by string matching [9], suffix tree matching [10], graph matching [11] and other algorithms

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Oct 26, 2020
Citations: 15	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Source Code Similarity Based on Siamese Neural Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Improving Code Completion by Solving Data Inconsistencies in the Source Code with a Hierarchical Language Model
Yixiao Yang
Electronics | VOL. 12
Yixiao YangYixiao Yang
27 Mar 2023
Electronics | VOL. 12

Actionable code smell identification with fusion learning of metrics and semantics
Dongjin Yu ... Yihang Xu
Science of Computer Programming | VOL. 236
Dongjin Yu, et. al.Dongjin Yu ... Yihang Xu
27 Mar 2024
Science of Computer Programming | VOL. 236

SLAMPA: Recommending Code Snippets with Statistical Language Model
Shufan Zhou ... Beijun Shen
-
Shufan Zhou, et. al.Shufan Zhou ... Beijun Shen
01 Dec 2018
01 Dec 2018

Single Object Tracking with Minimum False Positive using YOLOv4, VGG16, and Cosine Distance
Galuh Ramaditya ... Wikky Fawwaz Al Maki
JURNAL MEDIA INFORMATIKA BUDIDARMA | VOL. 6
Galuh Ramaditya, et. al.Galuh Ramaditya ... Wikky Fawwaz Al Maki
25 Oct 2022
JURNAL MEDIA INFORMATIKA BUDIDARMA | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Source Code Similarity Based on Siamese Neural Network

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences