Automated Identification of Sensitive Financial Data Based on the Topic Analysis

Meng Li,Jiqiang Liu,Yeping Yang

doi:10.3390/fi16020055

Meng Li, Jiqiang Liu + Show 1 more

Open Access

PDF Available

https://doi.org/10.3390/fi16020055

Copy DOI

Export

Save

Cite

Journal: Future Internet	Publication Date: Feb 8, 2024
Citations: 1	License type: CC BY 4.0

Affiliation: Beijing Jiaotong University

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Data governance is an extremely important protection and management measure throughout the entire life cycle of data. However, there are still data governance issues, such as data security risks, data privacy breaches, and difficulties in data management and access control. These problems lead to a risk of data breaches and abuse. Therefore, the security classification and grading of data has become an important task to accurately identify sensitive data and adopt appropriate maintenance and management measures with different sensitivity levels. This work started from the problems existing in the current data security classification and grading work, such as inconsistent classification and grading standards, difficult data acquisition and sorting, and weak semantic information of data fields, to find the limitations of the current methods and the direction for improvement. The automatic identification method of sensitive financial data proposed in this paper is based on topic analysis and was constructed by incorporating Jieba word segmentation, word frequency statistics, the skip-gram model, K-means clustering, and other technologies. Expert assistance was sought to select appropriate keywords for enhanced accuracy. This work used the descriptive text library and real business data of a Chinese financial institution for training and testing to further demonstrate its effectiveness and usefulness. The evaluation indicators illustrated the effectiveness of this method in the classification of data security. The proposed method addressed the challenge of sensitivity level division in texts with limited semantic information, which overcame the limitations on model expansion across different domains and provided an optimized application model. All of the above pointed out the direction for the real-time updating of the method.

Full Text