Abstract
Identifying clone code fragments across different languages can enhance the productivity of software developers in several ways. However, the clone detection task is often studied in the context of a single language and less explored for code snippets spanning different languages. In this paper, we present OneSpace, a new cross-language clone detection approach. OneSpace projects different programming languages to the same embedding space using both code and API data. OneSpace, hence, leverages a Siamese Network to infer the similarity of the embedded programs. We evaluate OneSpace by detecting clones across three language pairs; JAVA-Python, Java-C++ and Java-C. We compared OneSpace with the other state-of-art techniques, SupLearn and CLCDSA. In our evaluation, OneSpace provided higher effectiveness than the state of the art. Our ablation study validated some of our intuitions in designing OneSpace, particularly that using a single embedding space (as opposed to separate ones) provides higher effectiveness. Additionally, we designed a variant of OneSpace that uses Word-Mover-Distance Algorithm and provides lower effectiveness, but is much more efficient. We also found that OneSpace provides higher effectiveness than the state of the art, even for: complex implementations, single-method implementations, varying ratios of positive to negative clones in training, varying amounts of training data, and for additional programming languages.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.