Dataset for file fragment classification of textual file formats

Fatemeh Mansouri Hanis,Mehdi Teimouri

doi:10.1186/s13104-019-4837-4

Abstract

ObjectivesClassification of textual file formats is a topic of interest in network forensics. There are a few publicly available datasets of files with textual formats. Therewith, there is no public dataset for file fragments of textual file formats. So, a big research challenge in file fragment classification of textual file formats is to compare the performance of the developed methods over the same datasets.Data descriptionIn this study, we present a dataset that contains file fragments of five textual file formats: Binary file format for Word 97–Word 2003, Microsoft Word open XML format, portable document format, rich text file, and standard text document. This dataset contains the file fragments in three different languages: English, Persian, and Chinese. For each pair of file format and language, 1500 file fragments are provided. So, the dataset of file fragments contains 22,500 file fragments.

Highlights

Many researches have been carried in the field of file fragment classification of textual file formats [1–6]
Therewith, there is no public dataset for file fragments of textual file formats
We present a dataset that contains file fragments of five textual file formats: Binary file format for Word 97–Word 2003 (DOC), Microsoft Word open XML format (DOCX), portable document format (PDF), rich text file (RTF), and standard text document (TXT)

Summary

Introduction

Many researches have been carried in the field of file fragment classification of textual file formats [1–6]. There are a few publicly available datasets of files with different formats [7]. Therewith, there is no public dataset for file fragments of textual file formats. We present a dataset that contains file fragments of five textual file formats: Binary file format for Word 97–Word 2003 (DOC), Microsoft Word open XML format (DOCX), portable document format (PDF), rich text file (RTF), and standard text document (TXT). This dataset includes the file fragments in three different languages: English (EN), Persian (FA), and Chinese (CH).

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Research Notes	Publication Date: Dec 1, 2019
Citations: 7	License type: open-access

R Discovery Prime

R Discovery Prime

Dataset for file fragment classification of textual file formats

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Research Notes

Lead the way for us

Similar Papers

Dataset for file fragment classification of image file formats
Reyhane Fakouri ... Mehdi Teimouri
BMC Research Notes | VOL. 12
Reyhane Fakouri, et. al.Reyhane Fakouri ... Mehdi Teimouri
27 Nov 2019
BMC Research Notes | VOL. 12

Dataset for file fragment classification of audio file formats
Atieh Khodadadi ... Mehdi Teimouri
BMC Research Notes | VOL. 12
Atieh Khodadadi, et. al.Atieh Khodadadi ... Mehdi Teimouri
01 Dec 2019
BMC Research Notes | VOL. 12

Dataset for file fragment classification of video file formats
Narges Sadeghi ... Mehdi Teimouri
BMC Research Notes | VOL. 13
Narges Sadeghi, et. al.Narges Sadeghi ... Mehdi Teimouri
15 Apr 2020
BMC Research Notes | VOL. 13

Anomaly Detection in File Fragment Classification of Image File Formats
Zahra Seyedghorban ... Mehdi Teimouri
-
Zahra Seyedghorban, et. al.Zahra Seyedghorban ... Mehdi Teimouri
28 Oct 2021
28 Oct 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dataset for file fragment classification of textual file formats

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Research Notes