Plagiarism Detection of Multi-Threaded Programs via Siamese Neural Networks

Zhenzhou Tian,Cong Gao,Qing Wang,Lingwei Chen,Dinghao Wu

doi:10.1109/access.2020.3021184

Zhenzhou Tian, Cong Gao + Show 3 more

Open Access

https://doi.org/10.1109/access.2020.3021184

Copy DOI

Abstract

Widespread intentional or unintentional software plagiarisms have posed serious threats to the healthy development of software industry. In order to detect such evolving software plagiarism, software dynamic birthmark techniques of better anti-obfuscation ability serve as one of the most promising methods. However, due to the perturbation caused by non-deterministic thread scheduling in multi-threaded programs, existing dynamic approaches optimized for sequential programs may suffer from the randomness in multi-threaded program plagiarism detection. Some thread-aware birthmarking methods have been then proposed to address this issue, which nevertheless largely rely on manual feature engineering and empirical observations without any ground-truth training, and thus require domain knowledge, making them inflexible to be deployed in the wild. Inspired by the success of self-guided optimization using deep neural networks and their superior feature learning ability, in this article, we transform multiple execution traces for each multi-threaded program under a specified input to the plain feature matrix, and feed it to the deep learning framework to learn latent representation as thread-aware birthmark that enjoys better semantic richness and perturbation resistance; instead of empirically determining the plagiarism over direct birthmark similarity metric, we further build up sophisticated siamese neural networks to supervise birthmark construction, similarity measurement, and decision making. Integrating our proposed method, a system called NeurMPD is developed to perform Neural network-based Multi-threaded program Plagiarism Detection. The experimental results based on a public software plagiarism sample set demonstrate that NeurMPD copes better with multi-threaded plagiarism detection than alternative approaches.

Highlights

Open-source software communities and social coding platforms, such as GitHub, Stack Overflow, and CodeShare, have been enjoying explosive growth for recent years
We explore a novel perspective of dynamic birthmark construction for multi-threaded programs, where we take advantage of superior feature learning ability of Deep neural networks (DNNs), transform multiple execution traces for each multi-threaded program under the same input to the plain feature matrix, and feed it to the deep learning framework to learn the latent representation as threadaware birthmark
Comprehensive experimental studies on a public software plagiarism sample set are conducted to demonstrate that our developed plagiarism detection system NeurMPD can achieve the state-of-the-art results, which outperforms alternative baselines

Summary

Introduction

Open-source software communities and social coding platforms, such as GitHub, Stack Overflow, and CodeShare, have been enjoying explosive growth for recent years. Has drastically reshaped the software programming ecosystem that allows the developers all around the globe to conveniently reuse code snippets and libraries or adapt existing ready-to-use projects during the process of software development [39], [44] Such apparent benefits attract developers and researchers to legitimately study programming and understand software structure for extensions and comparisons, and some individuals and companies to violate the open source license to illegally incorporate others’ software code into their own commercial products for profit. To put it into perspective, the recent software.

Methods

Results

Conclusion