Abstract

For Android malware detection, machine learning-based (ML-based) methods show promising performance. However, limited studies are performed to investigate the impact of factors related to datasets on ML-based methods, while the performance of ML-based methods dramatically relies on datasets. To partially bridge the gap, we conduct an empirical study to investigate the impact of factors related to datasets on ML-based Android malware detection methods. By investigating dataset differences between real-world scenarios and experimental settings, we summarize three dataset factors (i.e., class imbalance, quality, and timelines) and assess the impact of these factors on ML-based Android malware detection methods. We conduct experiments on more than 11K benign and 17K malicious applications. The results show that these three dataset factors yield significant biases in the existing ML-based Android malware detection methods. Based on these results, we learn some lessons about assessing ML-based Android malware detection methods when taking dataset factors into account.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.