Improve Representation for Cross-Language Clone Detection by Pretrain Using Tree Autoencoder

Huading Ling,Aiping Zhang,Changchun Yin,Dafang Li,Mengyu Chang

doi:10.32604/iasc.2022.027349

Abstract

With the rise of deep learning in recent years, many code clone detection (CCD) methods use deep learning techniques and achieve promising results, so is cross-language CCD. However, deep learning techniques require a dataset to train the models. The dataset is typically small and has a gap between real-world clones due to the difficulty of collecting datasets for cross-language CCD. This creates a data bottleneck problem: data scale and quality issues will cause that model with a better design can still not reach its full potential. To mitigate this, we propose a tree autoencoder (TAE) architecture. It uses unsupervised learning to pretrain with abstract syntax trees (ASTs) of a large-scale dataset, then fine-tunes the trained encoder in the downstream CCD task. Our proposed TAE contains a tree Long Short-Term Memory (LSTM) encoder and a tree LSTM decoder. We design a novel embedding method for AST nodes, including type embedding and value embedding. In the training of TAE, we present an “encode and decode by layers” strategy and a node-level batch size design. For the CCD dataset, we propose a negative sampling method based on probability distribution. The experimental results on two datasets verify the effeteness of our embedding method, as well as that TAE and its pretrain enhance the performance of the CCD model. The node context information is well captured, and the reconstruction accuracy of the node-value reaches 95.45%. TAE pretrain improves the performance of CCD with a 4% increase in F1 score, which alleviates the data bottleneck problem.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improve Representation for Cross-Language Clone Detection by Pretrain Using Tree Autoencoder

Abstract

Talk to us

Similar Papers

More From: Intelligent Automation & Soft Computing

Lead the way for us

Journal: Intelligent Automation & Soft Computing	Publication Date: Jan 1, 2022
License type: cc-by

Similar Papers

Parallel and Distributed Code Clone Detection using Sequential Pattern Mining
Ali El-Matarawy ... Reem Bahgat
International Journal of Computer Applications | VOL. 62
Ali El-Matarawy, et. al.Ali El-Matarawy ... Reem Bahgat
18 Jan 2013
International Journal of Computer Applications | VOL. 62

Code Clone Detection Method Based on the Combination of Tree-Based and Token-Based Methods
Ryota Ami ... Hirohide Haga
Journal of Software Engineering and Applications | VOL. 10
Ryota Ami, et. al.Ryota Ami ... Hirohide Haga
01 Jan 2017
Journal of Software Engineering and Applications | VOL. 10

Insights into Deep Learning and Non-Deep Learning Techniques for Code Clone Detection
Ajinkya Kunjir
-
Ajinkya KunjirAjinkya Kunjir
08 May 2024
08 May 2024

Java Code Clone Detection by Exploiting Semantic and Syntax Information From Intermediate Code-Based Graph
Dawei Yuan ... Zhou Xu
IEEE Transactions on Reliability | VOL. 72
Dawei Yuan, et. al.Dawei Yuan ... Zhou Xu
01 Jun 2023
IEEE Transactions on Reliability | VOL. 72

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improve Representation for Cross-Language Clone Detection by Pretrain Using Tree Autoencoder

Abstract

Talk to us

Similar Papers

More From: Intelligent Automation &amp; Soft Computing

More From: Intelligent Automation & Soft Computing