Identifying File Similarity in Large Data Sets by Modulo File Length

Yongtao Zhou,Junjie Xie,Xiaoguang Chen,Yuhui Deng

doi:10.1007/978-3-319-11194-0_11

Abstract

Identifying file similarity is very important for data management. Sampling files is a simple and effective approach to identify the file similarity. However, the traditional sampling algorithm(TSA) is very sensitive to file modification. For example, a single bit shift would result in a failure of similarity detection. Many research efforts have been invested in solving/alleviating this problem. This paper proposes a Position-Aware Sampling(PAS) algorithm to identify file similarity in large data sets by modulo file length. This method is very effective in dealing with file modification when performing similarity detection. Comprehensive experimental results demonstrate that PAS significantly outperforms a well-known similarity detection algorithm called simhash in terms of precision and recall. Furthermore, the time overhead, CPU and memory occupation of PAS are much less than that of simhash.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Identifying File Similarity in Large Data Sets by Modulo File Length

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud
Yongtao Zhou ... Laurence T Yang
IEEE Transactions on Cloud Computing | VOL. 6
Yongtao Zhou, et. al.Yongtao Zhou ... Laurence T Yang
01 Jul 2018
IEEE Transactions on Cloud Computing | VOL. 6

Research on Fuzzy Clustering Algorithms for Large Dimensional Data Sets Under Cloud Computing
Shuang-Cheng Jia ...
-
Shuang-Cheng Jia, et. al.Shuang-Cheng Jia ...
01 Jan 2020
01 Jan 2020

A resistance outlier sampling algorithm for imbalanced data prediction
Xiaoying Pan ... Jiahao Huang
Intelligent Data Analysis | VOL. 26
Xiaoying Pan, et. al.Xiaoying Pan ... Jiahao Huang
18 Apr 2022
Intelligent Data Analysis | VOL. 26

Sparsity Representation of Beat Signal in Weather Radar for Compressive Sampling
Rita Purnamasari ... Irma Zakia
-
Rita Purnamasari, et. al.Rita Purnamasari ... Irma Zakia
01 Jul 2018
01 Jul 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Identifying File Similarity in Large Data Sets by Modulo File Length

Abstract

Talk to us

Similar Papers