Sensitive data identification for multi‐category and multi‐scenario data

Yuning Cui,Yonghui Huang,Yongbing Bai,Yuchen Wang,Chao Wang

doi:10.1002/ett.4983

Abstract

AbstractSensitive data identification is the prerequisite for protecting critical user and business data. Traditional methods usually only target a certain type of application scenario or a certain type of data, thus making it difficult to meet the needs of enterprise‐level data protection. This paper proposes an introduction to the end‐to‐end sensitive data identification system of Beike Inc. The system consists of the data identification & annotation platform, dataset management platform, and sensitive data identification model, which propose different governance methods for batch data and streaming data respectively. Specifically, we propose a sliding window‐based identification method for long text to improve the identification of streaming data. Evaluation results show that this method can improve the effect of identifying long text sensitive data without losing the ability on short text, for the open source test dataset, the value can be up to 94.15, so it is applicable in diverse scenarios.

Full Text