Abstract

Because of the complexity of software development, some software developers may plagiarize source code from other projects or open source software in order to shorten development cycle. Many methods have been proposed to detect plagiarism among programs based on the program dependence graph, a graph representation of a program. However, to our best knowledge, existing works only detect similarity between programs without detecting copy direction among them. By employing extreme learning machine (ELM), we construct feature space for describing features of every two programs with possible plagiarism relationship. Such feature space could be large and time consuming, so we propose approaches to construct a small feature space by pruning isolated control statements and removable statements from each program to accelerate both training and classification time. We also analyze the features of data dependencies between any original program and its copy program, and based on it we propose a feedback framework to find a good feature space that can achieve both accuracy and efficiency. We conducted a thorough experimental study of this technique on real C programs collected from the Internet. The experimental results show the high accuracy and efficiency of our ELM-based approaches.

Highlights

  • The Internet and open source software are developing rapidly nowadays, providing developers easier accesses to get various open source software code

  • Many methods have been proposed to detect plagiarism among programs based on the program dependence graph (PDG for short) [1], a graph representation of a program

  • We propose an extreme learning machine (ELM) based framework to learn this potential similarity and classify PDGs since ELM is well-known and very efficient for classification with high accuracy

Read more

Summary

Introduction

The Internet and open source software are developing rapidly nowadays, providing developers easier accesses to get various open source software code. Many methods have been proposed to detect plagiarism among programs based on the program dependence graph (PDG for short) [1], a graph representation of a program They only focus on detecting similarity but copy direction between any two similar programs (or PDGs). We first classify programs into a set of programs with possible plagiaristic relationship by adopting ELMs. Based on it, we detect copy direction by considering dependencies in programs. We show using ELM to classify PDGs with small sizes increases the classification accuracy and decreases classification time, since using smaller PDGs as training set means a smaller feature space, which accelerates both training and classification time This ELM-based approach greatly exceeds the detection ability of existing algorithms.

Preliminary and Background
An ELM-Based Framework for Determining Copy Direction
Achieving High Accuracy with Small Feature Space
Isolated Control Dependence Subgraphs
Determining Copy Direction
Experiments
Related Work
Findings
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call