Exsense: Extract sensitive information from unstructured data

Yongyan Guo,Jiayong Liu,Wenwu Tang,Cheng Huang

doi:10.1016/j.cose.2020.102156

Abstract

Large-scale sensitive information leakage incidents are frequently reported in recent years. Once sensitive information is leaked, it may lead to serious effects. In this context, sensitive information leakage has long been a question of great interest in the field of cybersecurity. However, most sensitive information resides in unstructured data. Therefore, how to extract sensitive information from voluminous unstructured data has become one of the greatest challenges. To address the above challenges, we propose a method named ExSense for extracting sensitive information from unstructured data, which utilizes the content-based and context-based extract mechanism. On the one hand, the method uses regular matching to extract sensitive information with predictable patterns. On the other hand, we build a model named BERT-BiLSTM-Attention for extracting sensitive information with natural language processing. This model uses the latest BERT algorithm to accomplish word embedding and extracts sensitive information by using BiLSTM and attention mechanism, with an F1 score of 99.15%. Experimental results on real-world datasets show that ExSense has a higher detection rate than using individual methods (i.e., content analysis and context analysis). In addition, we analyze about a million texts on Pastebin, and the results prove that ExSense can extract sensitive information from unstructured data effectively.

Full Text