Corpora-based Password Guessing: An Efficient Approach for Small Training Sets

Xiaochun Gan,Meng Chen,Weili Han,Dong Li,Hu Chen,Zongyan Wu

doi:10.1109/icece54449.2021.9674437

Abstract

Password guessing plays an important role in studying the vulnerability of passwords to improve security. In modern password guessing methods, the patterns of passwords from users in specific regions are discovered from a large number of leaked passwords. Most traditional methods, such as PCFG, Markov process, and other deep learning methods rely only on the training set. Different from other application areas of machine learning, the training set of password guessing comes from leaked real password sets, such as Rockyou, CSDN, and VK. Traditional approaches of password guessing are effective for large-scale training sets. However, the size of leaked password sets leaked by users of small languages or users of specific organizations is very small, which makes it difficult for current password guessing methods which relying only on training sets to discover enough words in passwords. In order to solve this problem, this paper proposed a corpus-based password guessing method. First, we analyzed the common words and their categories in the leaked password sets from users in three different countries. On this basis, we proposed an organization method for multiple language corpora, and constructed corpora of more than 3 million words. Secondly, we improved the traditional PCFG password segmentation method and described password structure based on corpora. Third, we evaluated the probability of words in the corpora which are not appearing in the training set based on the Lapalace smoothing. Actual tests show that our method can produce a finer structure than the PCFG. When the size of the training set decreases, the cracking rate of the PCFG decreases significantly, while the impact of our method is not significant, and the cracking rate is significantly higher than that of the PCFG.

Full Text