Malware classification is a specific and refined task within the broader malware detection problem. Effective classification aids in understanding attack techniques and developing robust defenses, ensuring application security and timely mitigation of software vulnerabilities. The dynamic nature of malware demands adaptive classification techniques that can handle the continuous emergence of new families. Traditionally, this is done by retraining models on all historical samples, which requires significant resources in terms of time and storage. An alternative approach is Class-Incremental Learning (CIL), which focuses on progressively learning new classes (malware families) while preserving knowledge from previous training steps. However, CIL assumes that each class appears only once in training and is not revisited, an assumption that does not hold for malware families, which often persist across multiple time intervals. This leads to shifts in the data distribution for the same family over time, a challenge that is not addressed by traditional CIL methods. We formulate this problem as Temporal-Incremental Malware Learning (TIML), which adapts to these shifts and effectively classifies new variants. To support this, we organize the MalNet dataset, consisting of over a million entries of Android malware data collected over a decade, in chronological order. We first adapt state-of-the-art CIL approaches to meet TIML's requirements, serving as baseline methods. Then, we propose a novel multimodal TIML approach that leverages multiple malware modalities for improved performance. Extensive evaluations show that our TIML approaches outperform traditional CIL methods and demonstrate the feasibility of periodically updating malware classifiers at a low cost. This process is efficient and requires minimal storage and computational resources, with only a slight dip in performance compared to full retraining with historical data.
Read full abstract