A common task in computer forensics is to recover files that lack file system metadata. In the case of searching for file fragments in unallocated space, file carving is the most often used method, which is ideal for unfragmented data. However, such methods and the tools based on them are ineffective for recovering OOXML files with a high fragmentation level. These methods do not provide reliable determination of the correct order of fragments. Techniques for reconstructing documents based on the analysis of words and phrases are also ineffective in fragmented OOXML documents. The main reason is that OOXML files are ZIP archives and, as a result, store data on disk space in a compressed form. This paper proposes a syntactical method for reconstructing OOXML documents based on knowledge about the internal structure of this file type, regardless of their content. The details of the implementation of the reconstruction algorithm and the peculiarities of restoring certain types of local elements of the document were considered. The efficiency of the algorithm was tested on the Govdocs1 and NapierOne datasets. The proposed method was applied to 4096-byte data blocks, which correspond to the standard cluster size in different file systems. The experimental results confirmed the method's suitability for practical use with 82.97 % of recovered files, including 34.38 % reconstructed completely, 0.43 % excluding the last 21 bytes at most, and another 48.16 % excluding embeddings that require other approaches. In the latter case, obtaining a fully working document without displaying graphic images and other contents of different embeddings is possible. The presence in OOXML files of CRC-32 hashes of the uncompressed data stream of each local element allows us to confirm the correctness of information recovery and its integrity unambiguously. Simultaneously, the method's effectiveness depends mainly on data verification methods during the reconstruction of local elements that occupy at least three clusters in the file. Therefore, this method is supposed to be improved by developing new mechanisms for verifying XML elements.
Read full abstract