Abstract
ABSTRACT Optimizing the efficiency of ML classifiers for Android malware detection is essential due to the continuous influx of new apps, many of which are redundant. These redundant apps increase time and space complexity, leading to model overfitting and degrading performance during classifier training and retraining. Existing dataset filtering mechanisms often result in slight performance degradation, whereas current retraining methods are unsustainable for longer periods due to their inability to handle growing data volumes. To address these challenges, we propose a novel opcode sequence-based redundant sample filtering mechanism that identifies and eliminates similar apps before retraining. This approach reduced the dataset size by 56%, resulting in a 37% reduction in false predictions for apps belonging to 2018. The classifier required updates only for samples whose opcode sequences significantly differed from those in the training dataset, based on an Ochiai coefficient threshold of 0.4. Consequently, only 8% of apps from 2018 to 2020 were necessary for classifier retraining, improving detection accuracy to 0.94 for apps belonging to 2021. Additionally, the mechanism has proven sustainable, achieving 0.95 detection accuracy for apps belonging to 2024 with an FPR of just 0.02. Furthermore, it identifies 91% of obfuscated malware apps, showcasing robustness against advanced evasion techniques.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have