Large Code Repositories Research Articles

Code clone detection refers to the discovery of identical or similar code fragments in the code repository. AST-based, PDG-based, and DL-based tools can achieve good results on detecting near-miss clones (i.e., clones with small differences or gaps) by using syntax and semantic information, but they are difficult to apply to large code repositories due to high time complexity. Traditional token-based tools can rapidly detect clones by the low-cost index (i.e., low frequency or k-lines tokens) on sequential source code, but most of them have the poor capability on detecting near-miss clones because of the lack of semantic information.In this study, we propose a fast yet accurate code clone detection tool with the semantic token, called CCStokener. The idea behind the semantic token is to enhance the detection capability of token-based tool via complementing the traditional token with semantic information such as the structural information around the token and its dependency with other tokens in form of n-gram. Specifically, we extract the type of relevant nodes in the AST path of every token and transform these types into a fixed-dimensional vector, then model its semantic information by applying n-gram on its related tokens. Meanwhile, our tool adopts and improves the location–filtration–verification process also used in CCAligner and LVMapper, during which process we build the low-cost k-tokens index to quickly locate the candidate code blocks and speed up detection efficiency. Our experiments show that CCStokener achieves excellent accuracy on detecting more near-miss clone pairs, which exhibits the best recall on Moderately Type-3 clones and detects more true positive clones on four java open-source projects. Moreover, CCStokener attains the best generalization and transferability compared with two DL-based tools (i.e., ASTNN, TBCCD).

Code search is a common activity in software development, and code-to-code search can benefit in a wide range of use-case scenarios. Code-to-code search uses a code fragment as the query for searching similar code fragments from large corpora. The results of a search can be applied to some software engineering tasks, such as search-based code recommendation, data-driven program repairing, and software plagiarism detection. To be put into daily use, the code-to-code search needs to find similar code fragments accurately and efficiently in a large dataset. Some search engines can locate exactly similar code, but are not able to search syntactical clones. Therefore, we propose ASTENS-BWA, a novel approach for searching syntactic similar code regions between code fragments via a tree-based sequence alignment. Source code has been transformed into a tree-based sequence that contains the structure information, and a sequence alignment algorithm has been applied to find similar regions. We evaluate ASTENS-BWA on three different tasks, the results demonstrate that our approach can find syntactical similar regions for programming code and retrieve similar code fragments fast and with high accuracy. As a code clone detection tool, ASTENS-BWA can report clone pairs in a high recall, but it needs manually check to reduce the false alarms. ASTENS-BWA is scalable and can report cloned code fragments in seconds for a code corpus of million lines of code. • Novel code clone search approach that scales to large source code repositories. • Quick searching syntactical clones between code fragments via tree-based representation. • Code clone search approach that can outperform state-of-the-art techniques. • Application of the proposed approach towards code recommendation and clone detection.

Large Code Repositories Research Articles

Related Topics

Articles published on Large Code Repositories

On Representation Learning-based Methods for Effective, Efficient, and Scalable Code Retrieval

I2R: Intra and inter-modal representation learning for code search

CCStokener: Fast yet accurate code clone detection with semantic token

ASTENS-BWA: Searching partial syntactic similar regions between source code fragments via AST-based encoded sequence alignment

Scalable Source Code Similarity Detection in Large Code Repositories

Augmenting and structuring user queries to support efficient free-form code search

Scalable code clone detection and search based on adaptive prefix filtering

Efficient plagiarism detection for large code repositories

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Large Code Repositories Research Articles

Related Topics

Articles published on Large Code Repositories

On Representation Learning-based Methods for Effective, Efficient, and Scalable Code Retrieval

I2R: Intra and inter-modal representation learning for code search

CCStokener: Fast yet accurate code clone detection with semantic token

ASTENS-BWA: Searching partial syntactic similar regions between source code fragments via AST-based encoded sequence alignment

Scalable Source Code Similarity Detection in Large Code Repositories

Augmenting and structuring user queries to support efficient free-form code search

Scalable code clone detection and search based on adaptive prefix filtering

Efficient plagiarism detection for large code repositories