Abstract

File fragment classification is an essential problem in digital forensics. Although several attempts had been made to solve this challenging problem, a general solution has not been found. In this work, we propose a hierarchical machine-learning-based approach with optimized support vector machines (SVM) as the base classifiers for file fragment classification. This approach consists of more general classifiers at the top level and more specialized fine-grain classifiers at the lower levels of the hierarchy. We also propose a primitive taxonomy for file types that can be used to perform hierarchical classification. We evaluate our model with a dataset of 14 file types, with 1000 fragments measuring 512 bytes from each file type derived from a subset of the publicly available Digital Corpora, the govdocs1 corpus. Our experiment shows comparable results to the present literature, with an average accuracy of 67.78% and an F1-measure of 65% using 10-fold cross-validation. We then improve on the hierarchy and find better results, with an increase in the F1-measure of 1%. Finally, we make our assessment and observations, then conclude the paper by discussing the scope of future research.

Highlights

  • It is essential for a forensic investigator to be able look at an artifact, which can be a network packet or a piece of data, and readily recognize what kind of data it is

  • Extr. 2020, 2 approach by using support vector machines (SVM) as our base classifier. We find that this approach, unrefined, opens up a different way of looking at the file fragment classification problem

  • Upon optimizing the SVM parameters using grid search, we found that we got the best results for the two parameters of the Radial Basis Function (RBF)

Read more

Summary

Introduction

It is essential for a forensic investigator to be able look at an artifact, which can be a network packet or a piece of data, and readily recognize what kind of data it is. In the machine learning description of the problem, each file type is thought to be a category (class) and certain features that are thought to characterize the file fragment are extracted. We propose a classification technique called hierarchical classification to classify file fragments without the help of file signatures present in headers and footers. We use the hierarchical classification technique for 14 different file types by taking support vector machines (SVM) [21] as our base classifiers to classify file fragments. 2020, 2 approach by using SVM as our base classifier We find that this approach, unrefined, opens up a different way of looking at the file fragment classification problem. We compare our results with the existing techniques which have been proposed in the literature, conclude the paper, and describe future works

File Fragment Classification
Hierarchical Classification
Hierarchy Definition
Feature Descriptions
Unigram Count Distribution
Entropy and Bigram Distribution
Mean Byte Value
3.12. Precision
Experiment Details
Evaluation Metrics
Comparison with Previous Works and Discussion
Conclusions and Future Works
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call