Abstract

Malicious Software (MALWARE) is a serious threat to system security the moment any electronic gadget or ‘Thing’ is connected to the World Wide Web (WWW). The malware - stealthy software that is used to collect sensitive information gains access to private systems and can disrupt device operation. Thus, malware acts against the user requirement and is a threat to all operating systems (OS), but more to Windows and Android systems, as those are the most widely used OS. Malware developers try to invade the system by means of viruses, adware, spyware, ransomware, botware, Trojans, etc. Developers try different anti-forensic techniques so that malware cannot be detected or investigated. Malware developers typically play ‘peekaboo’ with the malware investigators. The result is that investigating such attacks becomes more complex, and many times it fails because of immature forensics methodology or a lack of appropriate tools. This chapter is the first step towards analysing malware. The process started with malware dataset collection and understanding the same. ML has two basic blocks, i.e., feature extraction and classification. In the case of supervised learning, this feature plays a significant role. This asks for understanding features and their effect on classification, which was a major task. Two separate experimental processes were explored. The first one involved extracting n-grams from the binary files using the kfNgram tool, and the second one used a shell script to parse the assembly files for method calls to external API libraries. Several supervised machine learning classifiers like Decision Trees, SVM, and Naive Bayes were used to classify the malware family based on extracted features. We proposed a method to classify malware into nine families as per the Kaggle dataset. It analyses the n-gram of the malware file to generate the feature vector. Here, the value of ’n’ in n-gram is selectable; presently, it is four. The objective was to extract highly probable n-grams from the binary files after pre-processing, i.e., calculating the IG parameter. The present threshold for selecting n-gram from the top-most lists is five hundred. It has been observed that SVM and Decision trees provide accuracy on the scale of 98%. Nevertheless, there are chances of improvement as there is a probability of selecting irrelevant n-grams due to the sequential selection of n-grams. This method is considered a starting point for malware classification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.