The impact of using large training data set KDD99 on classification accuracy

  • Abstract
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

This study investigates the effects of using a large data set on supervised machine learning classifiers in the domain of Intrusion Detection Systems (IDS). To investigate this effect 12 machine learning algorithms have been applied. These algorithms are: (1) Adaboost, (2) Bayesian Nets, (3) Decision Tables, (4) Decision Trees (J48), (5)Logistic Regression, (6) Multi-Layer Perceptron, (7) Naive Bayes, (8) OneRule, (9)Random Forests, (10) Radial Basis Function Neural Networks, (11) Support Vector Machines (two different training algorithms), and (12) ZeroR. A well-known IDS benchmark dataset, KDD99 has been used to train and test classifiers. Full training data set of KDD99 is 4.9 million instances while full test dataset is 311,000 instances. In contrast to similar previous studies, which used 0.08%–10% for training and 1.2%–100% for testing, this study uses full training dataset and full test dataset. Weka Machine Learning Toolbox has been used for modeling and simulation. The performance of classifiers has been evaluated using standard binary performance metrics: Detection Rate, True Positive Rate, True Negative Rate, False Positive Rate, False Negative Rate, Precision, and F1-Rate. To show effects of dataset size, performance of classifiers has been also evaluated using following hardware metrics: Training Time, Working Memory and Model Size. Test results shows improvements in classifiers in standard performance metrics compared to previous studies.

Similar Papers
  • Research Article
  • Cite Count Icon 39
  • 10.1109/access.2022.3182818
Application Layer DDoS Attack Detection Using Cuckoo Search Algorithm-Trained Radial Basis Function
  • Jan 1, 2022
  • IEEE Access
  • Hakem Beitollahi + 2 more

In an application-layer distributed denial of service (App-DDoS) attack, zombie computers bring down the victim server with valid requests. Intrusion detection systems (IDS) cannot identify these requests since they have legal forms of standard TCP connections. Researchers have suggested several techniques for detecting App-DDoS traffic. There is, however, no clear distinction between legitimate and attack traffic. In this paper, we go a step further and propose a Machine Learning (ML) solution by combining the Radial Basis Function (RBF) neural network with the cuckoo search algorithm to detect App-DDoS traffic. We begin by collecting training data and cleaning them, then applying data normalizing and finding an optimal subset of features using the Genetic Algorithm (GA). Next, an RBF neural network is trained by the optimal subset of features and the optimizer algorithm of cuckoo search. Finally, we compare our proposed technique to the well-known k-nearest neighbor (k-NN), Bootstrap Aggregation (Bagging), Support Vector Machine (SVM), Multi-layer Perceptron) MLP, and (Recurrent Neural Network) RNN methods. Our technique outperforms previous standard and well-known ML techniques as it has the lowest error rate according to error metrics. Moreover, according to standard performance metrics, the results of the experiments demonstrate that our proposed technique detects App-DDoS traffic more accurately than previous techniques.

  • Peer Review Report
  • 10.7554/elife.85145.sa1
Decision letter: Statistical inference reveals the role of length, GC content, and local sequence in V(D)J nucleotide trimming
  • Jan 31, 2023
  • Thierry Mora

Local sequence context, length, and GC nucleotide content in both directions of the trimming site, together, are highly predictive of V(D)J trimming probabilities for both TR and IG adaptive immune receptor loci.

  • Research Article
  • Cite Count Icon 10
  • 10.1109/tase.2020.3035291
Parameter Identification for Bernoulli Serial Production Line Model
  • Nov 25, 2020
  • IEEE Transactions on Automation Science and Engineering
  • Yuting Sun + 3 more

Model-based analysis of production systems is one of the main areas in manufacturing research. The foundation of the successful application of these theoretical studies is the availability of valid and high-fidelity mathematical models that are capable of capturing the behavior of job flow in production systems. The modeling process of a production system, however, may require a significant amount of nonstandardized work that can only be done properly by someone with solid training in the area and extensive experience through real case studies. This poses a critical challenge in the effective implementation of these valuable theoretical results in the Industry 4.0 era. To overcome this, we propose a new production systems modeling paradigm inspired by system identification: calculate production system model parameters that best match the standard system performance metrics measured on the factory floor. Specifically, in this article, we consider production lines characterized by the Bernoulli serial line model and develop algorithms that identify model parameters to fit the system throughput and work-in-process. Analytical algorithms are derived to solve this problem in a two-machine line case and then extended to multi-machine lines. The accuracy and computational efficiency of the algorithms are demonstrated through extensive numerical experiments. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Note to Practitioners</i> —A high-fidelity mathematical model is of critical importance to the implementation of any model-based production system analysis method. Currently, the construction of such models is carried out in an ad hoc manner. The quality of the resulting models may heavily depend on the training, experience, intuition, and personal preference of the modeler. The proposed model parameter identification method focuses on standard key performance indices commonly measured on the factory floor. The advantage is twofold. First, these standard performance metrics are consistently defined regardless of industry, thus avoiding any data-ambiguity issue that may occur when using complex machine/equipment status data. Second, measuring these performance metrics in real time is typically convenient and cost effective, even for manufacturing plants without high-end IT infrastructure, thus making the technology accessible to not only large but also small- and mid-sized manufacturers. Using the algorithms developed in this article, a practitioner can quickly construct a serial production line model and then utilize it to access the rich library of production analysis, design, and control methods available in the literature.

  • Research Article
  • Cite Count Icon 32
  • 10.1007/s00034-018-0880-y
An Efficient QRS Complex Detection Using Optimally Designed Digital Differentiator
  • Jun 21, 2018
  • Circuits, Systems, and Signal Processing
  • Chandan Nayak + 3 more

Heart rate variability (HRV) analysis is considered as a preliminary diagnosis method to check the cardiac health of the human heart. The reliability of the HRV analysis system solely depends on the accuracy of the QRS complex detector. Hence, in this paper, an optimally designed digital differentiator (DD) for precise detection of QRS complex is proposed. The proposed DD is designed by using an efficient evolutionary optimization technique called gases Brownian motion optimization (GBMO) algorithm and is used in the preprocessing stage of the QRS detector. In GBMO algorithm, a balanced trade-off is maintained between both the exploration and the exploitation phases to find the global optimum solution. The electrocardiogram signal is preprocessed by using the proposed DD to generate the feature signals corresponding to the R-peaks only. The detection technique utilizes the principle of Hilbert transform and zeroes crossing detection. The proposed approach is verified against all the first channel records of MIT/BIH arrhythmia database by considering the standard QRS detection performance metrics and produces a sensitivity (Se) of 99.92%, positive predictivity (+P) of 99.92%, detection error rate (DER) of 0.1562%, QRS detection rate of 99.92%, accuracy (Acc) of 99.84%, and F score of 0.9992%. With respect to the standard performance metrics, the proposed QRS detector outperforms all the recently reported QRS detection techniques.

  • Research Article
  • Cite Count Icon 7
  • 10.26355/eurrev_202211_30373
Role of machine learning algorithms in predicting the treatment outcome of uterine fibroids using high-intensity focused ultrasound ablation with an immediate nonperfused volume ratio of at least 90.
  • Nov 1, 2022
  • European review for medical and pharmacological sciences
  • E Akpinar + 5 more

This study aimed to investigate the role of machine learning (ML) classifiers to determine the most informative multiparametric (mp) magnetic resonance imaging (MRI) features in predicting the treatment outcome of high-intensity focused ultrasound (HIFU) ablation with an immediate nonperfused volume (NPV) ratio of at least 90%. Seventy-three women who underwent HIFU treatment were divided into groups A (n=47) and B (n=26), comprising patients with an NPV ratio of at least 90% and <90%, respectively. An ensemble feature ranking model was introduced based on the score values assigned to the features by five different ML classifiers to determine the most informative mpMRI features. The relationship between the mpMRI features and the immediate NPV ratio of 90% was evaluated using Pearson's correlation coefficients. The diagnostic ability of the ML classifiers was evaluated using standard performance metrics, including the area under the receiver operating characteristic curve, accuracy, sensitivity, and specificity in eight folds cross-validation. For all the 12 most informative features, the area under receiver operating characteristic curve (AUROC), accuracy, specificity, and sensitivity ranged from 0.5 to 0.97, 0.34 to 0.97, 0.56 to 1.0, and 0.87 to 1.0, respectively. The gradient boosting (GBM) classifier demonstrated the best predictive performance with an AUROC of 0.95 and accuracy of 0.92, followed by the random forest, AdaBoost, logistic regression, and support vector classifiers, which yielded an AUROC of 0.92, 0.92, 0.83, and 0.78 and accuracy of 0.96, 0.88, 0.84, and 0.84, respectively. GBM had the best classifier performance with the best performing features from each mpMRI group, Ktrans ratio of the fibroid to the myometrium, the ratio of area under the curve of the fibroid to the myometrium, subcutaneous fat thickness, the ratio of apparent diffusion coefficient value of fibroid to the myometrium, and T2-signal intensity of the fibroid. The preliminary findings of this study suggest that the most informative and best performing features from each mpMRI group should be considered for predicting the treatment outcome of HIFU ablation to achieve an immediate NPV ratio of 90%.

  • Conference Article
  • Cite Count Icon 1
  • 10.2118/189811-ms
Evaluating Human-Machine Interaction for Automated Drilling Systems
  • Mar 13, 2018
  • A Farhangfar + 2 more

The efficient utilization of automation systems necessitates a clear understanding of the interaction of the human operator, the automation system and any automated routines being run. If automated routines perform actions not desirable to the human operator, time is lost as the routine is interrupted and human control re-engaged. In addition, automatic handoff back to the human operator, both due to human intervention and due to exist conditions or anomalies must also be managed. Activity data from rigs across North America is analyzed to understand automation process utilization and interrupt timing. Realtime and historic data is tagged, either automatically, semi-automatically using machine learning, or manually, to create a minute-by-minute timeline of rig operations. Operations are then classified both by operation – steering, reaming, making hole, etc. – and well plan to understand how operational demands change automation system utilization. This results in a new set of metrics which can be used to precisely quantify the performance metrics of both the human and automated drilling systems. Performance of the automation system is found to be a strong function of hole deviation with the system outperforming during simple operations and in the vertical hole, but with reduced performance while in the curve and horizontal, due to high interruption of certain tasks. It is found that standard performance metrics, such as slip to slip or weight to weight are affected by standard practices and if these are used to grade system performance, these practices must be account for. This paper presents a detailed investigation of the interaction of the driller with an automated drilling automation system and lays out the utilization of the automation system as a function of rig operations and well path. It is specially noted that standard performance metrics must consider standard practices which may differ between operations.

  • Research Article
  • 10.47772/ijriss.2026.100300368
Cultural Bias in Machine Learning Systems: A Philosophical and Empirical Study of Algorithmic Knowledge Production
  • Jan 1, 2026
  • International Journal of Research and Innovation in Social Science
  • Nabulongo Ali + 3 more

Machine learning systems are increasingly functioning as epistemic infrastructures in high-stakes domains such as criminal justice, healthcare, finance, and employment. Despite this, their outputs are frequently treated as objective and neutral forms of knowledge. This study advances a synthesis of empirical and philosophical inquiry into cultural bias in machine learning, arguing that algorithms operate as sociotechnical agents embedded within historically situated structures of power and representation. Using the COMPAS Recidivism dataset (N = 7,214), a quantitative experimental design was employed to examine predictive disparities across protected attributes, specifically race and sex. Logistic Regression and Random Forest models were implemented within a controlled preprocessing pipeline and evaluated using standard performance metrics (accuracy, precision, recall, and F1-score), alongside subgroup fairness measures including false positive rates (FPR), false negative rates (FNR), and disparate impact ratios. To ensure robustness, subgroup disparities were further assessed using statistical significance testing. While overall model performance was moderate in aggregate metrics, subgroup analysis revealed consistent and structured disparities: African-American defendants exhibited elevated false positive rates, whereas females and underrepresented racial groups experienced disproportionately high false negative rates. These patterns persisted across model architectures, indicating that bias is structurally embedded in the data rather than solely a function of model design. However, extreme subgroup values should be interpreted with caution due to potential sample size imbalances within certain demographic categories. The findings challenge the assumption of epistemic neutrality in algorithmic systems, demonstrating that machine learning models participate in the cultural production of knowledge by reproducing historically grounded classifications and power asymmetries. The study argues that algorithmic outputs should be evaluated not only in terms of predictive performance but also through fairness-aware and context-sensitive frameworks that account for their broader ethical and epistemological implications.

  • Research Article
  • 10.12688/f1000research.172498.1
The Improved Hybrid STD– Radial Basis Function Neural Network Approach for Time Series Forecasting Application to Tesla Stock Price Prediction
  • Feb 18, 2026
  • F1000Research
  • Hiba H Abdullah + 2 more

Abstract* The forecast of time series in financial applications is difficult to perform as time series forecasting is nonlinear in nature, seasonal, and has structural variability. Stock price series tend to follow a lot of nonlinear dynamics, which undermines the power of single-model approaches. Hybrid decomposition-based models have attracted increasing interest in order to gain accuracy by separating heterogeneous features from one another. In this work, we present a hybrid forecasting methodology that incorporates STD decomposition with RBFNN (Radial Basis Function Neural Network). The time series is decomposed, where trend, seasonal, and dispersion components are separately modeled using RBFNN with Gaussian basis functions. The predicted feature sets are then recombined to construct a forecast, to be evaluated with weekly Tesla stock price data and standard accuracy performance metrics. The STD–RBFNN model gives very low forecasting errors under different variables and a high coefficient of determination. It shows superiority compared with an alternative hybrid neural network model, especially in modeling nonlinear variation under similar experimental conditions. This results in substantially greater forecasting accuracy because trend, seasonal, and dispersion components separate before neural modeling. The proposed STD and RBFNN pipeline is a good and highly flexible method to forecast complex nonlinear and seasonal financial time series.

  • Book Chapter
  • 10.9734/bpi/nvst/v7/5077f
Machine Learning Algorithms for Heart Disease Prediction: A Comparative Analysis
  • Oct 26, 2021
  • Isreal Ufumaka

Machine learning has become popular today as so many of its algorithms are now commonly used in different data science projects in various industries especially in the health care sector. It is imperative for researchers and medical professionals to be assisted by machine learning methods in early detection of diseases such as heart disease which is one major killer of humans in our world today. Health and life threat can be prevented with a correct prediction of heart diseases, and an incorrect prediction of this disease can prove to be fatal. Machine learning algorithms are excellent at learning from data, and since healthcare providers generate huge amount of data on a daily basis, these algorithms can thrive in this field. In this research study, a comparative analytical approach was taken in the determination of which algorithm performs better under the given condition. Various experiments were carried out using cross validation of 5 and 10 folds, to ensure that models created can generalize well enough. This study makes use of data from University of California, Irvine (UCI) machine learning database containing 303 instances with 14 attributes. The collected data is scaled using Min-Max normalization technique. Different popular models are built using supervised machine learning classification algorithms on the scaled data such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Logistic Regression (LR), Naïve Bayes (NB), Random Forest (RF), and Gradient Boosting ensemble method. These algorithms are also evaluated using standard performance metrics such as precision, recall, and F1-score. From the experiments carried out, it can be concluded that SVM performs better as it out performs the other algorithms.

  • Conference Article
  • Cite Count Icon 29
  • 10.1109/ubmk.2019.8906995
Credit Card Fraud Detection with Machine Learning Methods
  • Sep 1, 2019
  • Gokhan Goy + 2 more

With the increase in credit card usage of people, the credit card transactions increase dramatically. It is difficult to identify fraudulent transactions among the vast amount of credit card transactions. Although credit card fraud is limited in number of transactions, it causes serious problems in terms of financial losses for individuals and organizations. Even though large number of studies has been conducted to solve this problem, there is no generally accepted solution. In this paper, a publicly available data set is used. The unbalance problem of the data set was solved by using hybrid sampling methods together. On this data set, comparative performance evaluations have been conducted. Different from other studies, the Area Under the Curve (AUC) metric, which expresses the success in such data sets, has also been used in addition to standard performance metrics. Since it is also important to quickly detect credit card fraud transactions; the running time of different methods is also presented as another performance metric.

  • Research Article
  • 10.12688/f1000research.172498.2
The Improved Hybrid STD- Radial Basis Function Neural Network Approach for Time Series Forecasting Application to Tesla Stock Price Prediction.
  • Jan 1, 2026
  • F1000Research
  • Hiba H Abdullah + 2 more

The forecast of time series in financial applications is difficult to perform as time series forecasting is nonlinear in nature, seasonal, and has structural variability. Stock price series tend to follow a lot of nonlinear dynamics, which undermines the power of single-model approaches. Hybrid decomposition-based models have attracted increasing interest in order to gain accuracy by separating heterogeneous features from one another. In this work, we present a hybrid forecasting methodology that incorporates STD decomposition with RBFNN (Radial Basis Function Neural Network). The time series is decomposed, where trend, seasonal, and dispersion components are separately modeled using RBFNN with Gaussian basis functions. The predicted feature sets are then recombined to construct a forecast, to be evaluated with weekly Tesla stock price data and standard accuracy performance metrics. The experimental analysis of weekly Tesla stock price data presents that the STD-RBFNN structure results in lower forecast errors, compared to the comparison hybrid model discussed in this paper. The improvement seems to be realized by decomposing the original series into components before learning non-linearly, and then by reconstructing the final forecast from component-wise predictions. But this empirical work remains narrow to a single-asset case type and the benchmark set here. The proposed framework is therefore considered to be a prospective hybrid forecasting design that needs validation across additional assets and forecasting models to achieve wider generalizability.

  • Conference Article
  • 10.1115/es2025-156776
Evaluating Vertical Building Integrated Photovoltaics With Standard Performance Metrics
  • Jul 8, 2025
  • Hannah Arnow + 3 more

Net-zero legislations are being implemented around the world to reduce buildings’ carbon emissions. Various new building technologies are developed to meet these new requirements. Building integrated photovoltaics (BIPV) is one of these technologies, being used to harvest solar energy on-site. However, they lack standard and user-friendly metrics to evaluate their overall performance. This paper investigates several standard metrics and proposes new ones for assessing the performance of vertical BIPV. The seasonal and annual energy production ratio are newly proposed metrics to compare the energy output of vertical BIPV to a south-facing PV system at optimal tilt and same geographical location. The annual specific yield and performance ratio, two standard metrics in solar industry, are also being presented to evaluate the system’s capability with respect to standard testing conditions. Finally, a payback period scaling factor is investigated as a method for rapid assessment of payback period for vertical BIPV systems. These metrics are reported for six distinct regions spanning across the US. With such metrics, vertical BIPV performance potential and the cost of implementing them at various locations in the US is more easily understood, which may increase the demand and acceptance of this type of technology.

  • Research Article
  • Cite Count Icon 46
  • 10.1016/j.eswa.2024.124922
Unsupervised anomaly detection in time-series: An extensive evaluation and analysis of state-of-the-art methods
  • Jul 30, 2024
  • Expert Systems With Applications
  • Nesryne Mejri + 5 more

Unsupervised anomaly detection in time-series has been extensively investigated in the literature. Notwithstanding the relevance of this topic in numerous application fields, a comprehensive and extensive evaluation of recent state-of-the-art techniques taking into account real-world constraints is still needed. Some efforts have been made to compare existing unsupervised time-series anomaly detection methods rigorously. However, only standard performance metrics, namely precision, recall, and F1-score are usually considered. Essential aspects for assessing their practical relevance are therefore neglected. This paper proposes an in-depth evaluation study of recent unsupervised anomaly detection techniques in time-series. Instead of relying solely on standard performance metrics, additional yet informative metrics and protocols are taken into account. In particular, (i) more elaborate performance metrics specifically tailored for time-series are used; (ii) the model size and the model stability are studied; (iii) an analysis of the tested approaches with respect to the anomaly type is provided; and (iv) a clear and unique protocol is followed for all experiments. Overall, this extensive analysis aims to assess the maturity of state-of-the-art time-series anomaly detection, give insights regarding their applicability under real-world setups and provide to the community a more complete evaluation protocol.

  • Research Article
  • Cite Count Icon 63
  • 10.3390/app10155261
Combining Internal- and External-Training-Loads to Predict Non-Contact Injuries in Soccer
  • Jul 30, 2020
  • Applied Sciences
  • Emmanuel Vallance + 4 more

The large amount of features recorded from GPS and inertial sensors (external load) and well-being questionnaires (internal load) can be used together in a multi-dimensional non-linear machine learning based model for a better prediction of non-contact injuries. In this study we put forward the main hypothesis that the use of such models would be able to inform better about injury risks by considering the evolution of both internal and external loads over two horizons (one week and one month). Predictive models were trained with data collected by both GPS and subjective questionnaires and injury data from 40 elite male soccer players over one season. Various classification machine-learning algorithms that performed best on external and internal loads features were compared using standard performance metrics such as accuracy, precision, recall and the area under the receiver operator characteristic curve. In particular, tree-based algorithms based on non-linear models with an important interpretation aspect were privileged as they can help to understand internal and external load features impact on injury risk. For 1-week injury prediction, internal load features data were more accurate than external load features while for 1-month injury prediction, the best performances of classifiers were reached by combining internal and external load features.

  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.procs.2024.02.135
Machine Learning for failure prediction: A cost-oriented model selection
  • Jan 1, 2024
  • Procedia Computer Science
  • Alessia Maria Rosaria Tortora + 4 more

Machine Learning for failure prediction: A cost-oriented model selection

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant