Abstract

Software vulnerabilities are one of the roots of computer security problems. The traditional static analysis and dynamic analysis methods based on software source code mainly have some deficiencies, such as high false positive rate, high false negative rate and insufficient semantic information captured. Nevertheless, the application of machine learning, Natural Language Processing and other technologies in software vulnerability prediction can effectively mitigate such issues. This paper proposed a vulnerability prediction method based on multiple-level N-gram feature extraction and heterogeneous ensemble learning. First, by code intermediate representation and constructing a multiple-level N-gram feature generation model, two kinds of N-gram semantic features with different window size and different granularity at word and char level were extracted to retain the semantic and structural information of code. Second, TF–IDF was used to construct the vector space model as the input of prediction model. As a single classifier was prone to overfitting and poor generalization, this paper conducted benchmark testing on five classical machine learning algorithms (NB, SVM, DT, LR, RF), and then combined four (SVM, DT, LR, RF) among them, which had better performance as the base classifiers to form the stacking heterogeneous ensemble method to build the vulnerability prediction model. Finally, the proposed method was verified on buffer overflow vulnerability and resource management vulnerability datasets, with a lowest false positive rate and false negative rate which can reach 1.58% and 4.06%, respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.