Comparison of Gain Ratio and Chi-Square Feature Selection Methods in Improving SVM Performance on IDS
An intrusion detection system (IDS) is a security technology designed to identify and monitor suspicious activity in a computer network or system and detect potential attacks or security breaches. The importance of accuracy in IDS must be addressed, given that the response to any alert or activity generated by the system must be precise and measurable. However, achieving high accuracy in IDS requires a process that takes work. The complex network environment and the diversity of attacks led to significant challenges in developing IDS. The application of algorithms and optimization techniques needs to be considered to improve the accuracy of IDS. Support vector machine (SVM) is one data mining method with a high accuracy level in classifying network data packet patterns. A feature selection stage is needed for an optimal classification process, which can also be applied to SVM. Feature selection is an essential step in the data preprocessing phase; optimization of data input can improve the performance of the SVM algorithm, so this study compares the performance between feature selection algorithms, namely Information Gain Ratio and Chi-Square, and then classifies IDS data using the SVM algorithm. This outcome implies the importance of selecting the right features to develop an effective IDS.
- Research Article
32
- 10.14355/ijcsa.2014.0304.02
- Jan 1, 2014
- International Journal of Computer Science and Application
With the growth of Internet, there has been a tremendous increases in the number of attacks and therefore Intrusion Detection Systems (IDS’s) has become a main stream of information security. The purpose of IDS is to help the computer systems to deal with attacks. This anomaly detection system creates a database of normal behaviour and deviations from the normal behaviour to trigger during the occurrence of intrusions. Based on the source of data, IDS is classified into Host based IDS and Network based IDS. In network based IDS, the individual packets flowing through the network are analyzed where as in host based IDS the activities on the single computer or host are analyzed. The feature selection used in IDS helps to reduce the classification time. In this paper, the IDS for detecting the attacks effectively has been proposed and implemented. For this purpose, a new feature selection algorithm called Optimal Feature Selection algorithm based on Information Gain Ratio has been proposed and implemented. This feature selection algorithm selects optimal number of features from KDD Cup dataset. In addition, two classification techniques namely Support Vector Machine and Rule Based Classification have been used for effective classification of the data set. This system is very efficient in detecting DoS attacks and effectively reduces the false alarm rate. The proposed feature selection and classification algorithms enhance the performance of the IDS in detecting the attacks.
- Conference Article
7
- 10.1109/ibssc47189.2019.8973103
- Jul 1, 2019
Malicious activities can harm the security of the system. These activities must be avoided. Network traffic data can be monitored and analyzed by using intrusion detection system. Different data mining classification techniques are used to detect network attacks. Dimensionality reduction performs key role in the Intrusion Detection System, since detecting anomalies is time-consuming. Recently a lot of work has been done in feature selection. But, most of the authors have modified the KDD99 test dataset. Modification of training dataset is valid but modifying test dataset is against the machine learning ethics. This work comprises some of the recently proposed feature selection algorithm such as Information gain, Gain Ratio and Correlation-based feature selection with the objective of determining the reduced feature set. The performance is evaluated using a combination of any two feature selection technique. This study proposes a new heuristic based feature selection algorithm using naive Bayes classifier to detect the important reduced feature set. The results are evaluated on c4.5 decision tree classifier and the results are compared with the existing works. The evaluated results show that the proposed reduced feature set gives the effective and efficient performance.
- Book Chapter
- 10.1007/978-981-19-8338-2_7
- Jan 1, 2023
As a result of the fast increase of the Internet and the increase in attacks, intrusion detection systems (IDS) have become an important source of information security. As the name implies, the purpose of IDS is to help computer systems protect themselves from threats. When an intrusion occurs, this anomaly detection system uses the regular behaviour database and alerts the user of deviations from normal behaviour. Hosting-based and network-based intrusion detection systems (IDS) are separated by a data source. Individual packets that travel through the network are scanned in network-based IDS, while single computer or host processes are scanned in host-based IDS. Selected features in IDS help to reduce classification time, and an effective network intrusion detection system (NIDS) has been established and implemented in this paper. To achieve this goal, a proposed feature selection (PFS) with bacterial forage optimization (BFO) method based on information gain ratio has been recommended and implemented. Proposed algorithm selects the optimal number of features from the NSL-KDD dataset using the selection method. Support vector machine (SVM) has been used to effectively classify the data. As a result, the system is particularly successful in detecting DoS attacks and reducing the number of false alarms. As a result of the proposed feature selection (PFS) with BFO and SVM classification techniques, IDS attacks can be identified. In terms of detection accuracy and time, the proposed feature selection algorithm (PFS) with BFO has approved some algorithms for selecting advanced functions.
- Research Article
3
- 10.52584/qrj.2001.07
- Jun 30, 2022
- Quaid-e-Awam University Research Journal of Engineering, Science & Technology
In line with the communication industry’s use of recent advancements in network technology to link remote areas of the world, attackers or intruders have stepped up their attacks on networking infrastructure. System administrators might deploy intrusion detection tools and systems to thwart such efforts. In recent years, the use of machine learning (ML) techniques in intrusion detection systems (IDSs) has increased. One of the most popular machine learning (ML) techniques for intrusion detection is the Support Vector Machine (SVM) due to its excellent generalization and capacity to escape the dimensionality curse. Recent studies have shown that the number of dimensions still impacts how well SVM-based intrusion detection systems work. The fact that SVM assesses all data characteristics equally has also caused some concerns. Actual intrusion detection datasets include a lot of redundant or superfluous characteristics. It would be ideal to consider feature weights while training an SVM. Knowledge Discovery in Databases (KDD) intrusion detection dataset offers labeled data for the scientists and researchers; choosing the essential features or patterns from the input dataset makes the problem more straightforward and faster and acquires much more accuracy towards threat detection. Our work demonstrates the efficiency of recognizing the essential input patterns to design a more efficient Intrusion Detection System (IDS). Consequently, removing irrelevant or unimportant inputs makes the problem of detecting a threat simpler, faster, and more accurate. It has been an essential issue in intrusion detection that features selection and ranking must be made accordingly; it is the only way to detect intrusion accurately and efficiently. We implement the procedure to remove one feature at a time to run experiments on a Support Vector Machine (SVM) to grade the significance of the features for the KDD dataset. It has been observed that SVM-based IDSs utilizing fewer features could improve and efficiently perform.
- Research Article
183
- 10.1016/j.jretconser.2015.07.003
- Jul 16, 2015
- Journal of Retailing and Consumer Services
A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring
- Conference Article
30
- 10.1109/icirca51532.2021.9544717
- Sep 2, 2021
Intrusion Detection System (IDS) is one of the most important security tool for many security issues that are prevailing in today's cyber world. Intrusion Detection System is designed to scan the system applications and network traffic to detect suspicious activities and issue an alert if it is discovered. So many techniques are available in machine learning for intrusion detection. The main objective of this project is to apply machine learning algorithms to the data set and to compare and evaluate their performances. The proposed application has used the SVM (Support Vector Machine) and ANN (Artificial Neural Networks) Algorithms to detect the intrusion rates. Each algorithm is used to detect whether the requested data is authorized or contains any anomalies. While IDS scans the requested data if it finds any malicious information it drops that request. These algorithms have used Correlation-Based and Chi-Squared Based feature selection algorithms to reduce the dataset by eliminating the useless data. The preprocessed dataset is trained and tested with the models to obtain the prominent results, which leads to increasing the prediction accuracy. The NSL KDD dataset has been used for the experimentation. Finally, an accuracy of about 48% has been achieved by the SVM algorithm and 97% has been achieved by ANN algorithm. Henceforth, ANN model is working better than the SVM on this dataset.
- Research Article
109
- 10.1016/j.cose.2009.01.001
- Jan 17, 2009
- Computers & Security
Building lightweight intrusion detection system using wrapper-based feature selection mechanisms
- Research Article
6
- 10.1007/s13369-016-2112-8
- Mar 28, 2016
- Arabian Journal for Science and Engineering
The dynamic nature of MANET makes it susceptible to several security breaches. A system that observes these kinds of unwanted activities is known as intrusion detection system (IDS). An IDS is responsible to alert the network, in case of any threat observation. This paper broadens the scope of IDS by considering intrusion response also. The proposed work is organized into several phases such as feature selection, trust degree computation, classification and decision making. Intelligent agents are employed to handle all the aforementioned phases. Features of KDD Cup ’99 are reduced from 41 to 17 to minimize the training time and to improve the accuracy of the system. Feature selection is achieved by information gain ratio. The trust degree is computed by the combination of packet delivery ratio, behavior and available energy of a node. The trust degree parameters are vital elements in the classification and the decision-making phase. Extreme learning machine (ELM) is employed as the classifier to categorize nodes into trustworthy, partially trustworthy and malicious. The performance of the system is evaluated in different scenarios such as with/without feature selection and with/without trust degree computation, with respect to detection accuracy, misclassification rate and detection time. The classification accuracy of SVM, MLP, ELM and ELM with trust is also compared.
- Conference Article
4
- 10.1109/icnwc57852.2023.10127442
- Apr 5, 2023
Traffic classification is an automated technique that divides computer network traffic into several categories depending on different factors like protocol or port number. In a complicated context, traffic categorization is an important tool for network and system security. A monitoring system called intrusion detection looks for abnormal activity and sends out notifications. In order to safeguard a system from network-based attacks, Network Intrusion Detection Systems (NIDS) play a crucial role in monitoring and analyzing network traffic. Active and passive intrusion detection systems (IDS), network intrusion detection systems (NIDS), host intrusion detection systems (HIDS), knowledge-based (signature-based) IDS, and behaviorbased (anomaly-based) IDS are some of the numerous types of intrusion detection systems (IDS). Passive IDS is just designed to monitor and analyze network traffic behaviour and notify an operator of potential vulnerabilities and attacks, whereas Active IDS is also known as Intrusion Detection and Prevention System. A network's malicious traffic is identified using a network-based intrusion detection system (NIDS). A host-based IDS monitors system activity and seeks for indications of abnormal behaviour. For networks with unidentified traffic, the intrusion detection system designed using flow and payload statistical characteristics and clustering approach needs additional clusters. The present intrusion detection system however is affected by false alarm rate, poor detection rate, imbalanced datasets and response time which lead to misclassification of intrusions in various scenarios. Hence, there is a requirement for developing an automated intrusion detection system that works well in different scenarios. The proposed system uses supervised and unsupervised intrusion detection and classification methods to increase the classification accuracy. To categorize the intrusions, dimensionality reduction strategies are used in conjunction with the classification procedure of logistic regression. Performance of intrusion detection system using PCA as dimensionality reduction algorithm has been evaluated with different classifiers such as Logistic Regression (LR), K-Nearest Neighbors (K-NN), Random Forest (RF), Support Vector Machine (Kernel SVM), Decision Tree (DT) using CIC IDS 2022 dataset. An automated way to detect intrusions has been proposed with cluster formation using adaptive weight butterfly optimization algorithm.
- Conference Article
14
- 10.1109/spin.2018.8474283
- Feb 1, 2018
Increasing attacks in the internet domain has led to the need of intrusion detection system (IDS) and many researchers have trying to improve the performance of IDS. Network-based IDS tries to detect intrusion using network-based features. KDD 1999 dataset has been widely used as benchmark intrusion detection dataset. It has 41 features. This work tries to extract a subset of 41 features without degrading the performance of IDS. For dimensionality reduction, this work uses Information gain(IG), Gain ratio(GR) and Correlation-based feature selection algorithms. This work also proposes a heuristic based dimensionality reduction approach to further improve the performance of the aforementioned feature selection algorithms.
- Research Article
90
- 10.1109/access.2020.2994931
- Jan 1, 2020
- IEEE Access
Machine learning techniques are becoming mainstream in intrusion detection systems as they allow real-time response and have the ability to learn and adapt. By using a comprehensive dataset with multiple attack types, a well-trained model can be created to improve the anomaly detection performance. However, high dimensional data present a significant challenge for machine learning techniques. Processing similar features that provide redundant information increases the computational time, which is a critical problem especially for users with constrained resources (battery, energy). In this paper, we propose two models for intrusion detection and classification scheme Trust-based Intrusion Detection and Classification System (TIDCS) and Trust-based Intrusion Detection and Classification System- Accelerated (TIDCS-A) for secure network. TIDCS reduces the number of features in the input data based on a new algorithm for feature selection. Initially, the features are grouped randomly to increase the probability of making them participating in the generation of different groups, and sorted based on their accuracy scores. Only the high ranked features are then selected to obtain a classification for any received packet from the nodes in the network, which is saved as part of the node's past performance. TIDCS proposes a periodic system cleansing where trust relationships between participant nodes are evaluated and renewed periodically. TIDCS-A proposes a dynamic algorithm to compute the exact time for nodes cleansing states and restricts the exposure window of the nodes. The final classification decision for both models is estimated by incorporating the node's past behavior with the machine learning algorithm. Any detected attack reduces the trustworthiness of the nodes involved, leading to a dynamic system cleansing. An evaluation of TIDCS and TIDCS-A using the NSL-KDD and UNSW datasets shows that both models can detect malicious behaviors providing higher accuracy, detection rates, and lower false alarm than state-of-art techniques. For instance, for UNSW dataset, the accuracy detection is 91% for TICDS, 83.47%by using online AODE, 88% for CADF, 90% for EDM, 90% for TANN and 69.6% for NB. Consequently, TICDS has better performance than the state of art techniques in terms of accuracy detection, while providing good detection and false alarm rates.
- Research Article
1
- 10.14569/ijacsa.2024.0150682
- Jan 1, 2024
- International Journal of Advanced Computer Science and Applications
The applications of the Internet of Things (IoT) are becoming increasingly popular nowadays. Network security and privacy are major concerns of the IoTs, as many IoT devices are connected to the network via the Internet, making IoT networks more vulnerable to various cyber-attacks. An Intrusion Detection System (IDS) is a solution to deal with security and privacy issues by protecting IoT networks from different types of attacks. In this paper, we provide a taxonomy of IDS in IoT. Different Machine Learning (ML) classifiers, feature selection models, and Datasets with high detection accuracy are presented. Our analysis indicates a heightened emphasis on ML-based IDS, with Support vector machines (SVMs) at 33% and RFs at 31% being the most widely used classifiers. Despite the diversity in the use of different datasets for IDS, the NSL-KDD is the most commonly used in 49% of studies. In the realm of feature selection, the K-means and SMO algorithms emerge with an impressive 99.33%, marking the highest percentage in previous research on feature selection for ML-based ID. Moreover, we addressed the future pathways and challenges of IDS detection.
- Research Article
22
- 10.4015/s1016237221500204
- Mar 9, 2021
- Biomedical Engineering: Applications, Basis and Communications
Breast cancer is a common cancer in female. Accurate and early detection of breast cancer can play a vital role in treatment. This paper presents and evaluates a thermogram based Computer-Aided Detection (CAD) system for the detection of breast cancer. In this CAD system, the Random Subset Feature Selection (RSFS) algorithm and hybrid of minimum Redundancy Maximum Relevance (mRMR) algorithm and Genetic Algorithm (GA) with RSFS algorithm are utilized for feature selection. In addition, the Support Vector Machine (SVM) and k-Nearest Neighbors (kNN) algorithms are utilized as classifier algorithm. The proposed CAD system is verified using MATLAB 2017 and a dataset that is composed of breast images from 78 patients. The implementation results demonstrate that using RSFS algorithm for feature selection and kNN and SVM algorithms as classifier have accuracy of 85.36% and 75%, and sensitivity of 94.11% and 79.31%, respectively. In addition, using hybrid GA and RSFS algorithm for feature selection and kNN and SVM algorithms as classifier have accuracy of 83.87% and 69.56%, and sensitivity of 96% and 81.81%, respectively, and using hybrid mRMR and RSFS algorithms for feature selection and kNN and SVM algorithms as classifier have accuracy of 77.41% and 73.07%, and sensitivity of 98% and 72.72%, respectively.
- Research Article
2
- 10.4274/tjo.galenos.2022.36724
- Jun 1, 2023
- Turkish Journal of Ophthalmology
To analyze the effect of macular choroidal thickness (MCT) and peripapillary choroidal thickness (PPCT) on the classification of obese and healthy children by comparing the performance of the random forest (RF), support vector machine (SVM), and multilayer perceptrons (MLP) algorithms. Fifty-nine obese children and 35 healthy children aged 6 to 15 years were studied in this prospective comparative study using optical coherence tomography. MCT and PPCT were measured at distances of 500 μm, 1,000 μm, and 1,500 μm from the fovea and optic disc. Three different feature selection algorithms were used to determine the most prominent features of all extracted features. The classification efficiency of the extracted features was analyzed using the RF, SVM, and MLP algorithms, demonstrating their efficacy for distinguishing obese from healthy children. The precision and reliability of measurements were assessed using kappa analysis. The correlation feature selection algorithm produced the most successful classification results among the different feature selection methods. The most prominent features for distinguishing the obese and healthy groups from each other were PPCT temporal 500 μm, PPCT temporal 1,500 μm, PPCT nasal 1,500 μm, PPCT inferior 1,500 μm, and subfoveal MCT. The classification rates for the RF, SVM, and MLP algorithms were 98.6%, 96.8%, and 89%, respectively. Obesity has an effect on the choroidal thicknesses of children, particularly in the subfoveal region and the outer semi-circle at 1,500 μm from the optic disc head. Both the RF and SVM algorithms are effective and accurate at classifying obese and healthy children.
- Research Article
473
- 10.1016/j.jnca.2011.01.002
- Jan 14, 2011
- Journal of Network and Computer Applications
Mutual information-based feature selection for intrusion detection systems