Abstract
Personally Identifiable Information (PII) has gained much attention with the rapid development of technologies and the exploitation of information relating to an individual. The corporates and other organizations store a large amount of information that is primarily disseminated in the form of emails that include personnel information of the user, employee, and customers. The security aspects of PII storage have been ignored, raising serious security concerns onindividual privacy. A significant concern arises about comprehending the responsibilities regarding the uses of PII. However, in real-time scenarios, email data is regarded as unstructured text data, detecting PII from such an unstructured large text corpus is quite challenging. This paper presents an intelligent clustering approach for automatically detecting personally identifiable information (PII) from a large text corpus. The focus of the proposed study is to design a model that receives text content and detects possible PII attributes. Therefore, this paper presents a clustering-based PII Model (C-PPIM) based on NLP and unsupervised learning to address detection of PII in the unstructured large text corpus. NLP is used to perform topic modeling, and Byte mLSTM, a different approach of sequence model, is implemented to address clustering problems in PII detection. The performance analysis of the proposed model is carried out existing hierarchical clustering concerning silhouette and cohesion score. The outcome indicatedthe effectiveness of the proposed system that highlights significant PII attributes, with significant scope in real-time implementation. In contrast, existing techniques are too expensive to function and fit in real-time environments.
Highlights
The progressive digitization of functional domains of various processes in individual human and business contexts produces various data types
For understanding the risks associated with privacy opening, several efforts have been made in the literature to detect Personally Identifiable Information" (PII) disclosure and leaks
natural language processing (NLP) and Byte-mLSTM mechanisms are used to design an effective model for the purpose of PII leak detection.The proposed system comprises topic modeling for segmenting and grouping data storage categories that account for disclosure of potential or most vulnerable PII
Summary
The progressive digitization of functional domains of various processes in individual human and business contexts produces various data types. The complexities and challenges of designing effective privacy preservation methods depend solely on the type of the data format, its size, and the data flow into the application Another popular term appears in the context of designing security models to preserve privacy – "Personally Identifiable Information" (PII). In real-world cases, the organization mostly maintains a large corpus which stores PII in the textual data format such as emails, contracts, IPv4 and MAC addresses, and telephone numbers. The rule-based approaches are not much suitable for identifying PII from the unstructured large text corpus because they mainly deal with structured data formats. With the advent of machine learning (ML) models and advancement in natural language processing (NLP), PII of individuals from large unstructured text corpus can be efficiently identified, which cannot be addressed by applying the existing rule-based solution discussed so far [1113].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advanced Computer Science and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.