Abstract

The number of software vulnerabilities is increasing year by year. In the era of big data, data-processing software with many users is more concerned by hackers. It is essential to improve the efficiency of discovering vulnerabilities in data-processing software. We noticed that in the process of discovering vulnerabilities, some problems of existing technology such as fuzzing, symbolic execution, and taint analysis have more or fewer relationships with data-processing functions. In fuzzing, there are two types of sanity checks toward the target program: NCC (Non-critical check) and CC (critical check). It is usually challenging to bypass such a sanity check, which leads to low code coverage during fuzzing. In symbolic execution, the constraint solver still has the problem of trying to deal with the constraints of complex algorithms. In taint analysis, the problem of over-taint and under-taint is always the key to affect the accuracy of the results. Therefore, to solve the above problems, it is necessary to identify the data-processing function. Based on identifying data-processing functions, we could identify those sanity checks, ease the solution of complex constraints, and understand the way of taints propagation to assist in software vulnerability discovery and analysis. This paper proposed a method called DPFI(data-processing function identification) for identifying data-processing functions with deep neural networks. We collected 37000 functions from GitHub and implemented the method on the data set with several neural networks, among which the performance of CNN achieved best and $F_{1}$ -score was 0.90. We then applied the trained model on CGC(cyber grand challenge) data and real softwares for testing. For CGC, we got 448 functions in 20 programs, in which 35 were identified as data-processing functions. For real softwares, such as FFmpeg, 7zip, jpeg, the precision rate all reached 0.90 and $F_{1}$ -score was above 0.87.

Highlights

  • In the era of big data, a variety of data is produced every second

  • In 2018, among all Windows products affected by vulnerabilities, the Office products accounted for 17% and the Adobe products accounted for 2%

  • We proposed a method for identifying data-processing functions accurately and quickly based on convolutional neural networks

Read more

Summary

Introduction

In the era of big data, a variety of data is produced every second. With the continuous improvement of computing power, forms of data-processing have emerged endlessly, and people’s dependence on such data-processing software has increased gradually. We noticed that data-processing software with a large number of users, such as Adobe and Office products, is more vulnerable to the close attention of hackers. In 2018, among all Windows products affected by vulnerabilities, the Office products accounted for 17% and the Adobe products accounted for 2%. The Office products and the Adobe products accounted for the highest vulnerabilities, exceeding 80% [1].

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.