Abstract

Potentially Harmful Apps (PHAs), like any other type of malware, are a problem in the modern Android ecosystem. Even though Google tries to maintain a clean app ecosystem, Google Play Store is still one of the main vectors for spreading PHAs. In this paper, we propose a solution based on machine learning algorithms to detect PHAs inside application markets. Being the application markets one of the main entry vectors, a solution capable of detecting PHAs submitted or in submission to those markets is needed. This solution is capable of detecting PHAs inside an application market and can be used as a filtering method, to automatically block the publishing of novel PHAs. The proposed solution is based on application static analysis, and even though several static analysis solutions have been developed, the innovation of this system is based on its training and the creation of its dataset. We have created a new dataset that uses as criteria the lifespan of an application inside Google Play, the shorter time an application is active inside an application market the higher the probability that this is a PHA. This criterion was added in order to avoid the usage and bias of antivirus engines for detecting malware. Involving the lifespan as criteria we created a new method of detection that does not replicate any existing antivirus engines. Experimental results have proved that this solution obtains a 90% accuracy score, using a dataset of 91,203 applications published on the Google Play Store. Despite showing a decrease in accuracy, compared with other machine learning models focused on detecting PHAs; it is necessary to take into account that this is a complementary and different method. The presented work can be combined with other static and dynamic machine learning models, since its training is drastically different, as it was based on lifespan measurements.

Highlights

  • M ALWARE detection techniques are constantly evolving due to the necessity of detecting the presence of malware

  • We present a novel method of detection based on lifespan measurements that can be used for detecting malware in application markets

  • Even though the model trained with the XGB algorithm reaches 89% accuracy, the Random Forest Classification (RFC) model achieves 90% accuracy with a false positive rate of 5.43%

Read more

Summary

Introduction

M ALWARE detection techniques are constantly evolving due to the necessity of detecting the presence of malware. Cybercriminals are constantly changing their techniques and novel methods of detection are needed to be developed. According to Statcounter, Android has a market share greater than 72% [1] This situation has caused an increase in the malware ecosystem because of its popularity [2] [3]. All of this is related to the rise of smartphone users worldwide, more than 6 billion in 2021 [4]. Being Google Play Store the main distribution vector, novel techniques that control who published and which applications are published need to be developed. Some of them could have heavy policies against adware, and others tolerate this type of PHAs

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.