A Method of Extracting Malware Features Based on Gini Impurity Increment and Improved TF-IDF

Yashu Liu,Shimiao Sun

doi:10.34028/iajit/20/3/14

Abstract

In recent years, the quantities and types of malwares have grown explosively, which bring many challenges to identify and detect them. In order to improve the identification efficiency of malicious code, a malicious code feature representation method based on feature dimension reduction is proposed. By fusing the Gini impurity increment and the Improved Term Frequency-Inverse Document Frequency algorithm (ITF-IDF), ΔGini-Improving Term frequency inverse document frequency (ΔGini-ITFIDF) method is presented, which can get more valuable assembly instruction features for family detection. ΔGini-ITFIDF standardizes the assembly instructions of the PE disassembly files, then, measures the two indicators of the expected error rate increment and weight of the malicious code assembly instruction features, and obtains more valuable features to identify malicious codes. The experimental results show that the classification accuracy of the ITF-IDF algorithm is significantly improved compared with the ITF-IDF algorithm. At the same time, ΔGini-ITFIDF can effectively improve the classification performance.

Full Text