With the development of IoT devices, there is a rapid increase in new types of IoT malware and variants, causing social problems. The malware’s phylogenetic tree has been used in many studies for malware clustering or better understanding of malware evolution. However, when dealing with a large-scale malware set, conventional methods for constructing a phylogenetic tree is very time-consuming or even cannot be done in a realistic time. To solve this problem, we propose a high-speed, scalable phylogenetic tree construction algorithm with a clustering algorithm to cluster it. The proposed method involves the following steps: (1) Calculating the similarity of the specimen pairs using the normalized compression distance. (2) Creating a phylogenetic tree containing all specimens, instead of calculating the similarity of all pairs of a specimen, our algorithm only calculates a small part of the similarity matrix. (3) Dividing the phylogenetic tree into clusters by applying the minimum description length criterion. In addition, we propose a new online processing algorithm to add new malware specimens into the existing phylogenetic tree sequentially. Our goal is to reduce the computational cost of constructing the phylogenetic tree and improve the clustering accuracy of our previous research. We evaluated our method’s clustering accuracy and scalability with 65,494 IoT malware specimens. The results showed that our algorithm reduced the computation by 97.52% compared with the conventional method. Our clustering algorithm achieved accuracies of 95.5% and 99.3% for clustering family name and architecture name, respectively.
Read full abstract