A Turkish language based data leakage prevention system

Yavuz Canbay,Hatice Yazici,Seref Sagiroglu

doi:10.1109/isdfs.2017.7916514

Yavuz Canbay, Hatice Yazici + Show 1 more

https://doi.org/10.1109/isdfs.2017.7916514

Copy DOI

Export

Save

Cite

Publication Date: Apr 1, 2017

Citations: 10

Affiliation: Gazi University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Data is the most valuable asset for organization and needs to be secured. Especially organizations should pay attention to data leakage issues because the leakage of sensitive data could give great damages or harm more than expected. There have been various factors causing data leakage but the recent research showed that the rate of insider attacks was the highest score and most of the data leakage incidents occurred in network based channels such as e-mails, clouds, instant messaging, etc. Insiders can consider transform or modify sensitive data. The main purpose of these attacks is to leak sensitive data from the systems without any alert of security systems. This paper aims to detect modification attacks on sensitive words, solve the data leakage problem and proposes a cascaded Data Leakage Prevention (DLP) system for Turkish language. The system consists of training and detecting phases. In training phase, a list of sensitive words was generated from the sensitive document sets. In detection phase, the main purpose is to detect modified sensitive content that attacker aims to bypass security systems. The types of possible attacks such as adding, deleting and changing characters in the sensitive words, deleting whitespaces from both sides of sensitive words and adding whitespace to the middle of the sensitive words were considered for designing the system. Boyer Moore (BM) algorithm was used to search exact sensitive strings exposed to whitespace attack and Smith Waterman (SW) sequential alignment algorithm was also employed to detect modified string attacks. TF-IDF method was used to extract the sensitive words of sensitive documents. Latent Semantic Indexing (LSI) was preferred to model document topics. For extracting and analyzing Turkish, Zemberek was used for. The results have shown that the proposed DLP system supporting Turkish language has a plausible solution.

Full Text