Abstract
Code clone detection refers to the discovery of identical or similar code fragments in the code repository. AST-based, PDG-based, and DL-based tools can achieve good results on detecting near-miss clones (i.e., clones with small differences or gaps) by using syntax and semantic information, but they are difficult to apply to large code repositories due to high time complexity. Traditional token-based tools can rapidly detect clones by the low-cost index (i.e., low frequency or k-lines tokens) on sequential source code, but most of them have the poor capability on detecting near-miss clones because of the lack of semantic information.In this study, we propose a fast yet accurate code clone detection tool with the semantic token, called CCStokener. The idea behind the semantic token is to enhance the detection capability of token-based tool via complementing the traditional token with semantic information such as the structural information around the token and its dependency with other tokens in form of n-gram. Specifically, we extract the type of relevant nodes in the AST path of every token and transform these types into a fixed-dimensional vector, then model its semantic information by applying n-gram on its related tokens. Meanwhile, our tool adopts and improves the location–filtration–verification process also used in CCAligner and LVMapper, during which process we build the low-cost k-tokens index to quickly locate the candidate code blocks and speed up detection efficiency. Our experiments show that CCStokener achieves excellent accuracy on detecting more near-miss clone pairs, which exhibits the best recall on Moderately Type-3 clones and detects more true positive clones on four java open-source projects. Moreover, CCStokener attains the best generalization and transferability compared with two DL-based tools (i.e., ASTNN, TBCCD).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.