Abstract
Along with the proliferation of big data technology, organizations are involved in an overwhelming data ocean, the huge volume of data makes them at a loss in the face of frequent data breaches due to their failure of efficient data security management. Data classification has become a hot topic as a cornerstone of data protection especially in China in recent years, by categorizing information types and distinguishing protective measures at different classification levels. Both the text and tables of the promulgated data classification-related regulations (for simplicity, laws, regulations, policies, and standards are collectively referred to as “regulations”) contain a wealth of valuable information which can guide the work of data classification. To best assist data practitioners, in this paper, we automatically “grasp” expert experience on how to classify data from the analysis of such regulations. We design a framework, GENONTO, that automatically extracts data classification practices (DCPs), such as information types and their corresponding sensitive levels to construct an information type lexicon as well as to encode a generic ontology on top of 38 real-world regulations promulgated in China. GENONTO employs machine learning techniques and natural language processing (NLP) to parse unstructured text and tables. To our knowledge, GENONTO is the first work that explores critical information like the category and the sensitivity of information types from regulations, and organizes them in a structured form of ontology, characterizing the subsumptive relations between different information types. Our research helps provide a well-defined integrated view across regulations and bridges the gap between what experts say and how data practitioners do.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.