Deobfuscating JavaScript Code Using Character-Based Tokenization

Alexandru-Gabriel Sîrbu,

doi:10.24193/subbi.2023.2.01

Abstract

The JavaScript code deployed goes through the process of minification, in which variables are renamed using single-character names and spaces are removed in order for the files to have a smaller size, thus loading faster. Because of this, the code becomes unintelligible, making it harder to be analyzed manually. Since JavaScript experts can under- stand it, machine learning approaches to deobfuscate the minified file are possible. Thus, we propose a technique that finds a fitting name for each obfuscated variable, which is both intuitive and meaningful based on the usage of that variable, based on a Sequence-to-Sequence model, which generates the name character by character to cover all the possible variable names. The proposed approach achieves an average exact name generation accuracy of 70.53%, outperforming the state-of-the-art by 12%. Keywords and phrases: JavaScript deobfuscation, variable name prediction, Deep Learning, Recurrent Neural Network, Abstract Syntax Tree.

Full Text