Code Clone Detection Research Articles

Background. Code search aims to find the most relevant code snippet in a large codebase based on a given natural language query. An accurate code search engine can increase code reuse and improve programming efficiency. The focus of code search is how to represent the semantic similarity of code and query. With the development of code pre-trained models, the pattern of using numeric feature vectors (embeddings) to represent code semantics and using vector distance to represent semantic similarity has replaced traditional string matching methods. The quality of semantic representations is critical to the effectiveness of downstream tasks such as code search. Currently, the state-of-the-art (SOTA) learning method uses the contrastive learning paradigm. The objective of contrastive learning is to maximize the similarity between matching code and query (positive samples) and minimize the similarity between mismatched pairs (negative samples). To increase the reusing of negative samples, prior contrastive learning approaches use a large queue (memory bank) to store embeddings. Problem. However, there is still a lot of room for improvement in using negative examples for code search: ① Due to the random selection of negative samples, semantic representations learned by existing models cannot distinguish similar codes well. ② Since semantic vectors in the memory bank are reused from previous inference results and then directly used for loss function calculation without gradient descent, the model cannot effectively learn the negative sample semantic information. Method. To solve the above problems, we propose a contrastive learning code search model with hard negative mining called CoCoHaNeRe: ❶ To enable the model to distinguish similar codes, we introduce hard negative examples into contrastive training, which are negative examples in the codebase that are most similar to positive examples. As a result, hard negative examples are most likely to make the model make mistakes. ❷ To improve the learning efficiency of negative samples during training, we add all hard negative examples to the model's gradient descent process. Result. To verify the effectiveness of CoCoHaNeRe, we conducted experiments on large code search datasets with six programming languages, as well as similar retrieval tasks code clone detection and code question answering. Experimental results show that our model achieves SOTA performance. In the code search task, the average MRR score of CoCoHaNeRe exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 11.25%, 8.13%, and 7.38%, respectively. It has also made great progress in code clone detection and code question answering. In addition, our method performs well in different programming languages and code pre-training models. Furthermore, qualitative analysis shows that our model effectively distinguishes high-order semantic differences between similar codes.

Read full abstract

AbstractEthereum, as a leading blockchain platform, has attracted a significant number of practitioners. These practitioners require a platform for communication and collaborative problem‐solving, which led to Ethereum Stack Exchange (ESE), a Q&A site dedicated to Ethereum‐related issues. While the Q&A site facilitates communication among practitioners, it also introduces new challenges. Practitioners adopt code snippets from Q&A sites to address problems encountered. However, the quality of code snippets on ESE remains largely unexplored. Vulnerabilities and gas‐inefficient patterns in ESE may spread to the code in Ethereum and threaten its regular operation. In this article, we conduct an empirical study investigating the distribution of vulnerabilities and gas‐inefficient patterns in ESE. Further, we analyze the potential impact of vulnerabilities and gas‐inefficient patterns from ESE on Ethereum. However, we encounter a problem during the vulnerability and gas‐inefficient pattern detection. Established smart contract analysis tools in the mainstream realm necessitate complete source code files for thorough analysis, while codes on ESE are often incomplete code snippets. To address this, we introduce the AST‐based code clone detection technique to construct detectable files corresponding to code snippets. This enables us to detect vulnerabilities and gas‐inefficient patterns in code snippets. In the end, our findings demonstrate that 11.18% of the contract‐level code snippets and 4.06% of function‐level code snippets in ESE have vulnerabilities. And 27.21% of contract‐level code snippets and 17.89% of function‐level code snippets contain gas‐inefficient patterns. The additional consumption caused by the gas‐inefficient pattern in ESE is approximately $1,695,002. Based on these findings, we provide recommendations for both ESE and its users, aiming to foster collaborative efforts and create a more reliable Q&A site for practitioners.

Read full abstract

Code Clone Detection Research Articles

Related Topics

Articles published on Code Clone Detection

Tra ined Wi thout My C onsent: Detecting Code Inclusion In Language Models Trained on Code

Effective Hard Negative Mining for Contrastive Learning-based Code Search

Evaluating few-shot and contrastive learning methods for code clone detection

Development and benchmarking of multilingual code clone detector

Adaptive Prefix Filtering for Accurate Code Clone Detection in Conjunction with Meta-learning

Semantic Code Clone Detection Based on Community Detection

Compiler-provenance identification in obfuscated binaries using vision transformers

Are the smart contracts on Q&A site reliable?

A Novel Source Code Representation Approach Based on Multi-Head Attention

BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques

A novel code representation for detecting Java code clones using high-level and abstract compiled code representations.

Out of step: Code clone detection for mobile apps across different language codebases

AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection

FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection

TeReKG: A temporal collaborative knowledge graph framework for software team recommendation

GRRLN: Gated Recurrent Residual Learning Networks for code clone detection

Federated Learning for Software Engineering: A Case Study of Code Clone Detection and Defect Prediction

An empirical study of code reuse between GitHub and stack overflow during software development

Code Clone Detection Based on Bytecode and Twin Neural Networks

A Novel Method for Code Clone Detection Based on Minimally Random Kernel Convolutional Transform

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Code Clone Detection Research Articles

Related Topics

Articles published on Code Clone Detection

Tra ined Wi thout My C onsent: Detecting Code Inclusion In Language Models Trained on Code

Effective Hard Negative Mining for Contrastive Learning-based Code Search

Evaluating few-shot and contrastive learning methods for code clone detection

Development and benchmarking of multilingual code clone detector

Adaptive Prefix Filtering for Accurate Code Clone Detection in Conjunction with Meta-learning

Semantic Code Clone Detection Based on Community Detection

Compiler-provenance identification in obfuscated binaries using vision transformers

Are the smart contracts on Q&amp;A site reliable?

A Novel Source Code Representation Approach Based on Multi-Head Attention

BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques

A novel code representation for detecting Java code clones using high-level and abstract compiled code representations.

Out of step: Code clone detection for mobile apps across different language codebases

AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual Adaptation for Code Clone Detection

FSD-CLCD: Functional semantic distillation graph learning for cross-language code clone detection

TeReKG: A temporal collaborative knowledge graph framework for software team recommendation

GRRLN: Gated Recurrent Residual Learning Networks for code clone detection

Federated Learning for Software Engineering: A Case Study of Code Clone Detection and Defect Prediction

An empirical study of code reuse between GitHub and stack overflow during software development

Code Clone Detection Based on Bytecode and Twin Neural Networks

A Novel Method for Code Clone Detection Based on Minimally Random Kernel Convolutional Transform

Are the smart contracts on Q&A site reliable?