Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Farhan Ullah,Sohail Jabbar,Mamoun Alazab,Junfeng Wang,Fadi Al-Turjman

doi:10.1109/access.2019.2943639

Abstract

Source Code Authorship Attribution (SCAA) is to find the real author of source code in a corpus. Though, it is a privacy threat to open-source programmers, but, it may be significantly helpful to develop forensic based applications. Such as, ghostwriting detection, copyright dispute settlements, and other code analysis applications. The efficient features extraction is the key challenge for classifying real authors of specific source codes. In this paper, the Program Dependence Graph with Deep Learning (PDGDL) methodology is proposed to identify authors from different programming source codes. First, the PDG is implemented to extract control and data dependencies from source codes. Second, the preprocessing technique is applied to convert PDG features into small instances with frequency details. Third, the Term Frequency Inverse Document Frequency (TFIDF) technique is used to zoom the importance of each PDG feature in source code. Fourth, Synthetic Minority Over-sampling Technique (SMOTE) is applied to tackle the class imbalance problem. Finally, the deep learning algorithm is applied to extract coding styles' features for each programmer and to attribute the real authors. The deep learning algorithm is further fine-tuned with drop out layer, learning error rate, loss and activation function, and dense layers for better accuracy of results. The proposed work is analyzed on 1000 programmers' data, collected from Google Code Jam (GCJ). The dataset contains three different programming languages, i.e., C++, Java, C#. The results are appreciable in outperforming the existing techniques from the perspective of classification accuracy, precision, recall, and f-measure metrics.

Highlights

The programming code authorship attribution is the programmers de-anonymization from source codes fragments using coding style features of known authors
The PROGRAM DEPENDENCE GRAPH (PDG) features may be used to extract hidden patterns regarding control flow logic and data variations in different programming codes. These PDG features are further used as input to the deep learning model to capture coding styles for identification of programmers
The term local and global weighting techniques are used to show the importance of each PDG feature

Summary

INTRODUCTION

The programming code authorship attribution is the programmers de-anonymization from source codes fragments using coding style features of known authors. It means that programmers coding style or stylistic fingerprint property is preserved in the software compilation process. The source code authorship attribution mainly depends on the extracted features that an author generates in coding structure naming variables. (iii) Generally, the large set of features extracted from source codes are not exactly relevant for authorship identification activities. The proposed research tries to respond to the following queries: 1) How to learn different types of source codes for authorship attribution and how to identify authors for different types of source codes?. Source code authorship attribution in cross programming languages using PDG analysis and deep learning model. The remaining paper is organized as follows: The section 2 contains the related work with state of the art discussions, the section 3 contains the proposed methodology, the experimental details are given in section 4 and section 5 includes the conclusion with future direction

RELATED WORK

PROPOSED METHODOLOGY

DEEP LEARNING MODEL

EXPERIMENTS

RESULTS ANALYSIS

Findings

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 32	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Language and Obfuscation Oblivious Source Code Authorship Attribution
Sarim Zafar ... Saeed Salem
IEEE Access | VOL. 8
Sarim Zafar, et. al.Sarim Zafar ... Saeed Salem
01 Jan 2020
IEEE Access | VOL. 8

Programmers' de-anonymization using a hybrid approach of abstract syntax tree and deep learning
Farhan Ullah ... Fadi Al-Turjman
Technological Forecasting and Social Change | VOL. 159
Farhan Ullah, et. al.Farhan Ullah ... Fadi Al-Turjman
09 Jul 2020
Technological Forecasting and Social Change | VOL. 159

Intrusion Detection System for Industrial Internet of Things Based on Deep Reinforcement Learning
Sumegh Tharewal ... Mohammad Shabaz
Wireless Communications and Mobile Computing | VOL. 2022
Sumegh Tharewal, et. al.Sumegh Tharewal ... Mohammad Shabaz
07 Mar 2022
Wireless Communications and Mobile Computing | VOL. 2022

Abstract 4915: Identification and validation of novel prognostic genetic markers in HER2-negative advanced gastric cancer (AGC) by artificial intelligence (AI) deep learning and machine learning algorithm
Sejung Park ... Seok-Jae Heo
Cancer Research | VOL. 84
Sejung Park, et. al.Sejung Park ... Seok-Jae Heo
22 Mar 2024
Cancer Research | VOL. 84

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Source Code Authorship Attribution Using Hybrid Approach of Program Dependence Graph and Deep Learning Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access