A Dynamic Analysis Data Preprocessing Technique for Malicious Code Detection with TF-IDF and Sliding Windows

Mihui Kim,Haesoo Kim

doi:10.3390/electronics13050963

Abstract

When using dynamic analysis data to detect malware, time-series data such as API call sequences are used to determine malicious activity through deep learning models such as recurrent neural networks (RNN). However, in API call sequences, APIs are called differently when different programs are executed. To use these data as input for deep learning, preprocessing is performed to unify the size of the data by adding dummy zeros to the data using the zero-padding technique. However, when the standard deviation of the size is significant, the amount of dummy data added increases, making it difficult for the deep learning model to reflect the characteristics of the data. Therefore, this paper proposes a preprocessing technique using term frequency–inverse document frequency (TF-IDF) and a sliding window algorithm. We trained the long short-term memory (LSTM) model on the data with the proposed preprocessing, and the results, with an accuracy of 95.94%, a recall of 97.32%, a precision of 95.71%, and an F1-score of 96.5%, showed that the proposed preprocessing technique is effective.

Full Text