Abstract

Data confidentiality, integrity and availability are the ultimate goals for all information security mechanisms. However, most of these mechanisms do not proactively protect sensitive data; rather, they work under predefined policies and conditions to protect data in general. Few systems such as anomaly-based intrusion detection systems (IDS) might work independently without much administrative interference, but with no dedication to sensitivity of data. New mechanisms called data leakage prevention systems (DLP) have been developed to mitigate the risk of sensitive data leakage. Current DLPs mostly use data fingerprinting and exact and partial document matching to classify sensitive data. These approaches can have a serious limitation because they are susceptible to data misidentification. In this paper, we investigate the use of N-grams statistical analysis for data classification purposes. Our method is based on using N-grams frequency to classify documents under distinct categories. We are using simple taxicap geometry to compute the similarity between documents and existing categories. Moreover, we examine the effect of removing the most common words and connecting phrases on the overall classification. We are aiming to compensate the limitations in current data classification approaches used in the field of data leakage prevention. We show that our method is capable of correctly classifying up to 90.5% of the tested documents.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.