Google Play and other application marketplaces have various Android applications and metadata. Among these, description information and privacy policy help explain the application's functionality. They also describe the permission of the application, especially those related to sensitive information. Detecting the inconsistency between the description of the application and privacy information and the permission extracted in the application's source code helps users decide whether to install and use the application. In this research, we propose a new method based on a pre-trained language model to detect inconsistencies between the permission extracted from the description application and privacy policy and the permission extracted from the application's source code (file APK). Related works focus on models of large-scale datasets, especially for resource-rich languages such as English. However, a language with low resources, like Vietnamese, needs more datasets for the task. To solve this problem, we propose the ViDPApp dataset (Description and Privacy Policy of Applications on Vietnamese domains), a high-quality dataset that humans manually annotate with 12,000+ sentences with an inter-annotator agreement (IAA) of over 85%. In addition, we proposed XLMR4MD, a new framework using large language models, outperforming powerful machine models (LSTM, Bi-GRU-LSTM-CNN, WikiBERT, DistilBERT, mBERT, and PhoBERT) and achieving the best with 84.04% F1 score in detecting inconsistencies between Android application permission and description. This framework can be fine-tuned for 100 languages, which benefits low-resource languages like Vietnamese. The dataset is available for research purposes.
Read full abstract