FbHash-E: A time and memory efficient version of FbHash similarity hashing algorithm

Monika Singh,Mohona Ghosh,Anviksha Khunteta,Donghoon Chang,Somitra Kumar Sanadhya

doi:10.1016/j.fsidi.2022.301375

Abstract

With the rapid advancements in digital technologies and the exponential growth of digital artifacts, automated filtering of cybercrime data for digital investigation from a variety of resources has become the need of the hour. Many techniques primarily based on the “Approximate Matching” approach have been proposed in the literature to address this challenging task. In the year 2019, Chang et al. proposed one such algorithm - FbHash: A New Similarity Hashing Scheme for Digital Forensics that was shown to produce the best correlation results compared to other existing techniques and also resist active adversary attack, unlike others. However, no performance analysis of the tool was given. In this work, we show that the current design structure of FbHash is slower and memory intensive compared to its peers. We then propose a novel Bloom filter based efficient version, i.e., FbHash-E that has a much lower memory footprint and is computationally faster compared to FbHash. While the speed of FbHash-E is comparable to other state-of-the-art tools, it is resistant (like its predecessor) to “intentional/intelligent modifications that can fool the tool” attacks, unlike its peers. Our version thus renders FbHash-E fit for practical use-cases.We perform various modification tests to evaluate the security and correctness of FbHash-E. Our experiment results show that our scheme is secure against active attacks and detects similarity with 87% accuracy. Compared to FbHash, there is only 3% drop in accuracy results. We demonstrate the sensitivity and robustness of our proposed scheme by performing a variety of containment and resemblance tests. We show that FbHash-E can correlate files with up to 10% random-noise with 100% detection rate and is able to detect commonality as small as 1% between the two documents with an appropriate similarity score. We also show that our proposed scheme performs best to identify similarities between different versions of software or program files. We also introduce a new test, i.e., consistency test, and exhibit that our tool produces consistent results across all files under a fixed category with very low standard deviation, unlike other tools where standard deviation under a fixed test varies significantly. This shows that our tool is more robust and stable against different modifications.

Full Text