Abstract
Identification of malware is a critical problem in computer security. Many signature-identification, behavior-recognition, and reputation-based tools are available for host-based detection. However, so many files are present on systems today that checking all files is time-consuming, and better methods are needed to suggest which files are of highest priority to check in partial scans. This work developed and tested local contextual clues to malware in the metadata of file systems on an international corpus of 248 million files on 3961 drives. 398,949 hash values of malware were found in this corpus using five methods, and 3,681,211 hash values of non-malware were chosen for comparison using three methods. Malware identification rates were compared for the fifteen combinations and were cross-correlated for different types of drives and file types. Results showed that different malware identification methods find significantly different things. Then the strength of particular local clues in file metadata (directory and file names, sizes, times, and hash values) was assessed and results were compared for the fifteen combinations. Some classic clues (e.g. rare file extensions and deletion status) were confirmed and others were not (e.g. double extensions and occurrence in the operating system). With this data, a program was implemented to estimate the likelihood that a given file was malware based solely on its metadata context. With three random subsets of our corpus, our methods gave 51 times better precision (fraction of malware in files identified as malware) with 70% better recall (fraction of malware detected) than the approach of inspecting executables alone. They also ran significantly faster than signature checking, and can be used before other kinds of malware analysis.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.