Abstract

This research proposes an approach to build malicious PDF detection system using random forest algorithm, focusing the Evasive-PDFMal2022 dataset which is updated and extended with the addition of new datasets. This dataset includes malicious PDF files from CVE and Exploit-DB, non-malicious PDF files, as well as files from private collections and Technically-oriented PDF Collection. Features were extracted using the PDFID tool, resulting in 29 structural features that formed the basis for the Random Forest classification algorithm. Experiments showed that the model trained with the new dataset provided accuracy equivalent to the Evasive-PDFMal2022 model, at 98%, albeit with a small decrease in recall for the benign class. In addition, this research involved the creation of a website for metadata extraction and malicious PDF detection. Recognition goes to the dataset contributors, tool developers, and dataset providers from NIST and Exploit-DB. Overall, this research successfully increased the representation and diversity of the dataset, provided good model training results, improved detection from 3 malicious PDF variants to 13 variants, and created a practical tool for malicious PDF extraction and detection. Nonetheless, further development may be required to improve detection performance in more complex scenarios

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call