Detecting Malware with Information Complexity.

Nadia Alshahwan,David Clark,Héctor D Menéndez,George Danezis,Earl T Barr

doi:10.3390/e22050575

Nadia Alshahwan, David Clark + Show 3 more

Open Access

https://doi.org/10.3390/e22050575

Copy DOI

Journal: Entropy	Publication Date: May 20, 2020
Citations: 16	License type: CC BY 4.0

Affiliation: University College London, Middlesex University

Abstract

Malware concealment is the predominant strategy for malware propagation. Black hats create variants of malware based on polymorphism and metamorphism. Malware variants, by definition, share some information. Although the concealment strategy alters this information, there are still patterns on the software. Given a zoo of labelled malware and benign-ware, we ask whether a suspect program is more similar to our malware or to our benign-ware. Normalized Compression Distance (NCD) is a generic metric that measures the shared information content of two strings. This measure opens a new front in the malware arms race, one where the countermeasures promise to be more costly for malware writers, who must now obfuscate patterns as strings qua strings, without reference to execution, in their variants. Our approach classifies disk-resident malware with 97.4% accuracy and a false positive rate of 3%. We demonstrate that its accuracy can be improved by combining NCD with the compressibility rates of executables using decision forests, paving the way for future improvements. We demonstrate that malware reported within a narrow time frame of a few days is more homogeneous than malware reported over two years, but that our method still classifies the latter with 95.2% accuracy and a 5% false positive rate. Due to its use of compression, the time and computation cost of our method is nontrivial. We show that simple approximation techniques can improve its running time by up to 63%. We compare our results to the results of applying the 59 anti-malware programs used on the VirusTotal website to our malware. Our approach outperforms each one used alone and matches that of all of them used collectively.

Highlights

The receiver operating characteristic illustrates the trade-off between True Positive (TP) and False Positive (FP) for all possible decision boundaries
We investigated the performance of compressibility rates compared to Normalised Compression Distance (NCD) in classifying malware
We showed that our approach competed very well with existing malware detection programs and was as good as any of them when applied to our data

Summary

Introduction

We found that when you apply the method to malware collected within a short time scale of a few days, it detects malware with 97.4% accuracy; rather astonishing from a standing start To attain this 97.4% level, we augmented our technique with compressibility rates and combined these with NCD values using a decision forest classifier. This paper makes the following contributions: 2 It conducts the first statistically rigorous, experimental evaluation of the ability of NCD to detect malware using only binary executables. It demonstrates that NCD, as used in our approach, is competitive with commercial anti-malware tools: it outperforms any single one and matches the performance of all of them together. Our work builds on her observational descriptive statistics to build and rigorously evaluate an NCD-based malware classifier

Background

Classifying Malware Using NCD

Choice of Compressor

Classifier

Lower Bound on NCD

Evading NCD

Evaluation

Corpus

Classifier Parametrisation

NCD Classifier

Compressibility Rate Classifier

Size of Malware Reporting Window

NCD Cost Reduction

Comparison to Antivirus Software

Feature Importance Analysis

Related Work

Detection

Classification

Other Approaches

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Detecting Malware with Information Complexity.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy

Lead the way for us

Similar Papers

Assessing the Impact of Student Peer Review in Writing Instruction by Using the Normalized Compression Distance
Sayuri Yoshizawa ... Takao Terano
IEEE transactions on professional communication | VOL. 55
Sayuri Yoshizawa, et. al.Sayuri Yoshizawa ... Takao Terano
01 Mar 2012
IEEE transactions on professional communication | VOL. 55

Gene ontology prediction using compression based distances and alignment scores on both amino acid sequence and secondary structure
Asli Filiz ... Zehra Cataltepe
-
Asli Filiz, et. al.Asli Filiz ... Zehra Cataltepe
01 Oct 2008
01 Oct 2008

Brain functional connectivity at rest as similarity of neuronal activities
F Tecchio ... D Vitulano
Electroencephalography and Clinical Neurophysiology/Evoked Potentials Section | VOL. 127
F Tecchio, et. al.F Tecchio ... D Vitulano
10 Feb 2016
Electroencephalography and Clinical Neurophysiology/Evoked Potentials Section | VOL. 127

Contextual information retrieval based on algorithmic information theory and statistical outlier detection
Rafael Martinez ... David Camacho
-
Rafael Martinez, et. al.Rafael Martinez ... David Camacho
01 May 2008
01 May 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detecting Malware with Information Complexity.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Entropy