Computer malware has greatly impacted the computer network securities and even personal computer users. Signature-based detection is incapable to recognize the obfuscated computer malware since it is being covered by the obfuscation techniques. Therefore, machine learning is being explored and equipped in the malware detection to withstand the threaten of malware. In fact, there are many features available, i.e., text string to be implemented for malware classification. Nevertheless, opcode could be one of the features owing to its relative smaller data size compared to the text string. In this research, the significant opcodes of executable malware files which referring to the prevalent content from malware-to-malware generation are extracted as training dataset. Several machine learning classifiers are generated and compared in terms of classification accuracy and speed, as well as the comparison is done with text string-based detection and signature-based detection. From the finding, it is shown that machine learning detection performs more than 2 times better than signature based and machine learning generated based-on significant opcode features is able to detect obfuscated malware over 10 times faster than text string feature and still achieve up to 98% of accuracy.
Read full abstract