Abstract

Traditional PDF document detection technology usually builds a rule or feature library for specific vulnerabilities and therefore is only fit for single detection targets and lacks anti-detection ability. To address these shortcomings, we build a double-layer detection model for malicious PDF documents based on an entropy method with multiple features. First, we address the single detection target problem with the fusion of 222 multiple features, including 130 basic features (such as objects, structure, content stream, metadata, etc.) and 82 dangerous features (such as suspicious and encoding function, etc.), which can effectively resist obfuscation and encryption. Second, we generate the best set of features (a total of 153) by creatively applying an entropy method based on RReliefF and MIC (EMBORAM) to PDF samples with 37 typical document vulnerabilities, which can effectively resist anti-detection methods, such as filling data and imitation attacks. Finally, we build a double-layer processing framework to detect samples efficiently through the AdaBoost-optimized random forest algorithm and the robustness-optimized support vector machine algorithm. Compared to the traditional static detection method, this model performs better for various evaluation criteria. The average time of document detection is 1.3 ms, while the accuracy rate reaches 95.9%.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call