File Fragment Classification with Focus on OLE and OOXML classes

K Skracic,P Pale,J Petrovic,K Milicic,F Rukavina

doi:10.23919/mipro48935.2020.9245428

Abstract

Classification of file fragments is a crucial step in digital forensics and determining file types based on available data fragments. Currently explored file fragment classification methods other than forensic hand-examination rely on machine learning techniques. Those methods most commonly use features based on byte frequency distribution as inputs in artificial neural networks. In this paper, some new approaches to file fragment classification are explored. Older MS Office file format files (doc, ppt, and xls), and the new MS Office format (docx, pptx, and xlsx), which were previously shown to be difficult to differentiate between, were joined into two separate higher-level classes due to similarities in the included files' structure. Different approaches to specifically differentiating between subtypes in each of those two higher-level classes are further explored in the paper. The results suggest small increases in classification accuracy can be achieved using the proposed approach.

Full Text