Abstract As cyber threats continue to evolve, the accurate detection of malicious files has
become increasingly crucial. Traditional approaches often fall short due to limited
adaptability to diverse file types and a high incidence of false predictions. This study
addresses these gaps by systematically evaluating entropy-based features in conjunction
with machine learning (ML) models for malicious file detection. Using diverse file
types—documents, images, and compressed files—we employed byte-level analysis of
each file’s raw stream—without any tokenization into chunks—to compute Shannon
entropy (F1) and R´enyi entropy (α = 2, 4, 6) (F2), along with statistical measures
(25th percentile, mean, and 75th percentile of F1+F2) as key features. From these
streams, we extracted statistical measures—25th percentile, mean, and 75th percentile
of F1 and F2—which capture fine-grained entropy variations indicative of potential
malicious patterns. Files exceeding an entropy threshold of 7.95 were classified as
malware-infected, with the NapierOne dataset serving as the basis for analysis. To
minimize false negatives, we applied three popular ML models—Random Forest (RF),
Decision Tree (DT), and Naive Bayes (NB). Our findings reveal that image-based files
are particularly vulnerable to malware, while Shannon entropy combined with RF and
DT models effectively reduced false negative predictions. The RF model emerged as
the best-performing approach, achieving over 99% detection accuracy across all file
types. This research provides a novel integration of entropy-based features with ML
models, demonstrating their synergy for improving malware detection. By addressing
key limitations in previous approaches, this study lays the foundation for scalable,
accurate, and robust detection frameworks, paving the way for future advancements
in combating dynamic cybersecurity threats.
Read full abstract