Malware Detection using Opcode Trigram Sequence with SVM

Amr I Elkhawas,Nashwa Abdelbaki

doi:10.23919/softcom.2018.8555738

Abstract

Malicious software also known as “Malware” is software that uses legitimate instructions or code to perform malicious actions. Malware poses a major threat for computer security and information security in general. Over the years, malware has evolved to the point that a single malware specimen can have hundreds or maybe thousands of variants using polymorphic and metamorphic transformation to change the signature of the malware variant in propagation. The common signature-based malware detection methods are no longer robust to detect these variants due to the alteration of code. Static analysis is required to obtain these signatures and anti-virus companies are required to propagate these signature updates to their software. A faster detection method is needed to compensate the exponentially increasing number of malware variants. Machine learning is a trending approach for malware detection. This removes the need to use signature-based detection and is also faster. Software companies do not need to propagate signatures as often. Machine learning algorithms using opcode sequences can recognise patterns in the malicious code that are not present in common signatures and classify them more efficiently. Therefore, a machine learning approach for malware detection should be adopted for faster and more efficient detection. Most research in malware detection using machine learning used static attributes such as network connections, processes spawned, hashes, etc., that were not that robust to changes. In this paper we introduced our novel approach in using trigrams and PE file attributes as features for malware detection. We took a text mining approach to make our detection method more robust to polymorphism and metamorphism. The instruction sequence for critical code in malware on the assembly level is basically the same across malware families. We used opcode trigram sequences as the main feature for our machine learning algorithm. We used Support Vector Machine(SVM) as our classifying algorithm which is a discriminative classifier model that gives a definite decision whether the predicted outcome belongs to the learned class or not. The above shows our novel approach that enabled us to get higher detection rates with less features.

Full Text