SDOT: Secure Hash, Semantic Keyword Extraction, and Dynamic Operator Pattern-Based Three-Tier Forensic Classification Framework

D Paul Joseph,P Viswanathan

doi:10.1109/access.2023.3234434

Abstract

Most traditional digital forensic techniques identify irrelevant files in a corpus using keyword search, frequent hashes, frequent paths, and frequent size methods. These methods are based on Message Digest and Secure Hash Algorithm-1, which result in a hash collision. The threshold criteria of files based on frequent sizes will lead to imprecise threshold values that result in an increased evaluation of irrelevant files. The blacklisted keywords used in forensic search are based on literal and non-lexical, thus resulting in increased false-positive search results and failure to disambiguate unstructured text. Due to this, many extraneous files are also being considered for further investigations, exacerbating the time lag. Moreover, the non-availability of standardized forensic labeled data results in <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$(O(2^{n}))$ </tex-math></inline-formula> time complexity during the file classification process. This research proposes a three-tier Keyword Metadata Pattern framework to overcome these significant concerns. Initially, Secure Hash algorithm-256 hash for the entire corpus is constructed along with custom regex and stop-words module to overcome hash collision, imprecise threshold values, and eliminate recurrent files. Then blacklisted keywords are constructed by identifying vectorized words that have proximity to overcome traditional keyword search’s drawbacks and to overcome false positive results. Dynamic forensic relevant patterns based on massive password datasets are designed to search for unique, relevant patterns to identify the significant files and overcome the time lag. Based on tier-2 results, files are preliminarily classified automatically in O(log n) complexity, and the system is trained with a machine learning model. Finally, when experimentally evaluated, the overall proposed system was found to be very effective, outperforming the existing two-tier model in terms of finding relevant files by automated labeling and classification in O(nlog n) complexity. Our proposed model could eliminate 223K irrelevant files and reduce the corpus by 4.1% in tier-1, identify 16.06% of sensitive files in tier-2, and classify files with 91% precision, 95% sensitivity, 91% accuracy, and 0.11% Hamming loss compared to the two-tier system.

Full Text