Debiasing Android Malware Datasets: How Can I Trust Your Results If Your Dataset Is Biased?

Tomás Concepción Miranda,Pierre Wilke,Jean-François Lalande,Valérie Viet Triem Tong,Pierre-Francois Gimenez

doi:10.1109/tifs.2022.3180184

Abstract

Android security has received a lot of attention over the last decade, especially malware investigation. Researchers attempt to highlight applications’ security-relevant characteristics to better understand malware and effectively distinguish malware from benign applications. The accuracy and the completeness of their proposals are evaluated experimentally on malware and goodware datasets. Thus, the quality of these datasets is of critical importance: if the datasets are outdated or not representative of the studied population, the conclusions may be flawed. We specify different types of experimental scenarios. Some of them require unlabeled but representative datasets of the entire population. Others require datasets labeled with valuable characteristics that may be difficult to compute, such as malware datasets. We discuss the irregularities of datasets used in experiments, questioning the validity of the performances reported in the literature. This article focuses on providing guidelines for designing debiased datasets. First, we propose guidelines for building representative datasets from unlabeled ones. Second, we propose and experiment a debiasing algorithm that, given a biased labeled dataset and a target representative dataset, builds a representative and labeled dataset. Finally, from the previous debiased datasets, we produce datasets for experiments on Android malware detection or classification with machine learning algorithms. Experiments show that debiased datasets perfom better when classifying with machine learning algorithms.

Full Text