Abstract

The paper deals with a problem of the obfuscated JavaScript code detection and classification based on Abstract Syntax Trees (AST) coloring. Colors of the AST vertexes and edges are assigned with regard to the types of the AST vertexes specified by the program lexical and syntax structure and the programming language standard. Research involved a few stages. First of the all, a non-obfuscated JavaScript programs dataset was collected by the public repositories evaluation. Secondly, obfuscated samples were created using eight open-source obfuscators. Classifier models were built using an algorithm of gradient boosting on the decision trees (GBDT). We built two types of the classifiers. The first one is the model that classifies the program according to the type of the obfuscator used, i.e. based on what obfuscator created the sample. The second one tries to detect samples obfuscated by the obfuscator whose samples are not observed during training. The quality of the obtained models is on par with the known published results. The feature engineering method proposed in the paper does not require a preliminary analysis of the obfuscators and obfuscating transformations. In the final part of the paper we analyze a quality of models estimated, discussing the certain statistical properties of the obfuscated and non-obfuscated samples obtained and corresponding colored ASTs. Analysis of generated samples of obfuscated programs has shown that the method proposed in the paper has some limitations. In particular, it is difficult to recognize minifiers or other obfuscating programs, which change the lexical structure to a greater extent and the syntax to a lesser extent. To improve the quality of detection of this kind of obscuring transformations, one can built combined classifiers using both the method based on the AST coloring and the additional information about lexemes and punctuation, for example, entropy of identifiers and strings, proportion of characters in upper and lower case, usage frequency of certain characters etc.

Highlights

  • Îáôóñêàöèÿ øèðîêî èñïîëüçóåòñÿ â èíôîðìàöèîííîé áåçîïàñíîñòè: êàê äëÿ çàùèòû èíòåëëåêòóàëüíîé ñîáñòâåííîñòè ïðè ðàñïðîñòðàíåíèè ëåãàëüíîãî ïðîïðèåòàðíîãî ÏÎ, òàê è äëÿ ñîêðûòèÿ êîìïüþòåðíûõ àòàê, óñëîæíåíèÿ îáíàðóæåíèÿ è îáðàòíîé ðàçðàáîòêè âðåäîíîñíîãî ÏÎ, îäíàêî íà ñåãîäíÿøíèé äåíü íåò åäèíîãî îáùåïðèíÿòîãî è äîñòàòî÷íî ñòðîãîãî

  • Webshell detection based on random forestgradient boosting decision tree algorithm // 3rd intern. conf. on data science in cyberspace: DSC 2018 (Guangzhou, China, June 18-21, 2018): Proc

  • The paper deals with a problem of the obfuscated JavaScript code detection and classification based on Abstract Syntax Trees (AST) coloring

Read more

Summary

Îáçîð èññëåäîâàíèé

Ïðè èçó÷åíèè âðåäîíîñíîãî êîäà èëè àíàëèçå îáôóñöèðîâàííîãî êîäà íà ÿçûêå JavaScript èññëåäîâàòåëè îáû÷íî îáðàùàþòñÿ ê ñèíòàêñè÷åñêîé ñòðóêòóðå ïðîãðàìì. ÀÑÄ èñïîëüçóþòñÿ â ðàçëè÷íûõ èññëåäîâàòåëüñêèõ ïðîåêòàõ, ïîñâÿùííûõ äåòåêòèðîâàíèþ âðåäîíîñíûõ ïðîãðàìì ìåòîäîì ìàøèííîãî îáó÷åíèÿ, äåòåêòèðîâàíèþ îáôóñêàöèè, àíàëèçó îáôóñöèðîâàííîãî êîäà. Òàê êàê ÷èñëî ðàçëè÷íûõ öâåòîâ âåðøèí â ÀÑÄ ðàâíî m, òî îïðåäåëåíî m ðàçëè÷íûõ ôóíêöèé f1, . Ôóíêöèÿ fm+m2+1: A → R ïîäñ÷èòûâàåò ÷èñëî öâåòîâ, èñïîëüçîâàííûõ äëÿ ðàñêðàñêè âñåõ âåðøèí ÀÑÄ. Ôóíêöèÿ fm+m2+2: A → R ïîäñ÷èòûâàåò ÷èñëî öâåòîâ, èñïîëüçîâàííûõ äëÿ ðàñêðàñêè âñåõ ðáåð ÀÑÄ. Äëÿ ïîñòðîåíèÿ äåòåêòîðà îáôóñöèðîâàííîãî êîäà òðåáóåòñÿ íàéòè òàêîå îòîáðàæåíèå u: Rm+m2+2 → [0, 1], ÷òî ñðåäíåå çíà÷åíèå ôóíêöèè ïîòåðü ñòðåìèòñÿ ê ìèíèìóìó. Ôóíêöèÿ q: Rm+m2+2 → {0, 1} îïðåäåëåíà äëÿ âåêòîðîâ ïðèçíàêîâ ïðîãðàìì èç ñãåíåðèðîâàííîãî íàáîðà îáðàçöîâ îáôóñöèðîâàííîãî è íåîáôóñöèðîâàííîãî êîäà. Ck} → {0, 1} îïðåäåëåíà äëÿ âåêòîðîâ ïðèçíàêîâ ïðîãðàìì èç ñãåíåðèðîâàííîãî íàáîðà îáðàçöîâ îáôóñöèðîâàííîãî è íåîáôóñöèðîâàííîãî êîäà. Äëÿ ïîñòðîåíèÿ îòîáðàæåíèÿ v ïðåäëàãàåòñÿ èñïîëüçîâàòü àëãîðèòì ãðàäèåíòíîãî áóñòèíãà íà ðåøàþùèõ äåðåâüÿõ, â êà÷åñòâå ôóíêöèè ïîòåðü çàäåéñòâîâàòü ôóíêöèþ êðîññ-ýíòðîïèè k.

Ñáîð è ïðåäîáðàáîòêà äàííûõ
15. Gnirts
19. UglifyJS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call