A web application is a dynamic, intricate, and interactive program that provides end-users with information and services such as utility payments, online communication, e-learning, socializing, shopping, online banking, and income tax filing etc. Web applications have become a major target for attackers due to their accessibility, availability, and ubiquity. Web application vulnerabilities are hazardous for some reasons. Attackers can harm an organizations image and status. The implementation flaws in web application allow the invader to infuse user-input that violates the syntax-based assembly of the query or infuse malicious code etc. Among various types of injection flaws, SQL injection (SQLI) is more prominent than (XML) both are considered as common application-layer web attack, which allows the attacker to bypass the security mechanisms therefore; these two are ranked as the most common vulnerabilities. Hence, a methodology for detecting evaluating both SQLI & XML vulnerabilities in web applications are considered for research. This research work addresses the above mentioned flaws and proposed an Ensemble Method to classify the Structure Query Language injection vulnerabilities, we selected a benchmark dataset with 33,758 rows containing; various types of SQL and XML injection attacks. Raw data is preprocessed to remove artifacts, and then feature engineering is performed using Natural Language Processing techniques to clean the data and extract 6 types of features such as TF-IDF, Word-to-Vector, SkipGram, Count Vectorizer, Glove and Continuous Bag of words. Imbalance data is handled using sampling techniques, best features are selected using 4 types of validation techniques Significant Test, PCA, Variance Threshold and Sbest. Prepared data is provided to Ensemble Model having two stages; Stage-2 accepts URL from the user and detects presence of susceptibility in the sub domains and domains. Stage-1 having 9 different types of machine learning models Multinomial, Gaussian, Bernoulli Naive Bayes, Logistic Regression, Decision Tree, Random Forest, AdaBoost, SVC with, poly, rbf and linear kernel, these models are trained on additional vectors such as google news and glove to detect the new query either SQL or XML for presences or absence of vulnerability, using this proposed ensemble approach obtained the accuracy of 99%.
Read full abstract