A MODIFIED CORRELATION BASED REGULARIZATION TECHNIQUE FOR REGRESSION ESTIMATION AND FEATURE SELECTION
Variable selection is important for making sense with (ultra) high-dimensional data. Penalized least squares such as the LASSO, elastic-net and the correlation based elastic-net (L1CP) are popular methods for carrying out variable selection and estimation simultaneously. This study proposes a modified version of the L1CP motivated by reasons similar to that given by Zou and Hastie (2005) where the naïve elastic net was rescaled to give the elastic net. The scaling transformation is derived such that the double shrinkage caused by applying two penalties is undone thereby reducing bias. The derived scaling transformations are found to depend on the correlations among the predictors. A robust worst-case quadratic solver is used to obtain estimates. An evaluation of the proposed method which is referred to as CL1CP alongside the L1CP, LASSO and elastic-net through simulation studies illustrate the advantages of the CL1CP compared to the other alternatives considered especially in correct selection of sparse models. In terms of variable selection, estimation and prediction accuracy the proposed CL1CP performs favourably compared to the L1CP, LASSO and elastic-net especially for “grouped-variables” selection. Results from applications to two real life datasets corroborate the findings from simulation studies.
- Research Article
- 10.35877/454ri.jinav1836
- Jun 21, 2023
- JINAV: Journal of Information and Visualization
The research was conducted to reveal the effect of LQ45 stock on the accuracy of stock price index fluctuations using the C4.5 algorithm with Correlation-Based Feature Selection (CFS) and Information Gain (IG) techniques. This study used the superior C4.5 algorithm using a combination feature selection technique between Correlation-based Feature Selection (CFS) and Information Gain in the hope of getting accurate results. Analysis conducted on the LQ45 index through various stages that include data collection, manual pre-processing, validation methods, process features, decision tree model result, and classification accuracy performance. The result of test revealed that the implementation of the C4.5 algorithm using correlation-based feature selection (CFS) and information gain techniques can be applied well to LQ45 stocks. The accuracy generated from the original data (without the selection feature) was 77.857%, while the addition of features to the combination of Correlation-Based Feature Selection (CFS) and Information Gain had a large influence on the results of increasing data accuracy from the accuracy of the original data by 77.857% to 78.333%. Thus, the C4.5 calculation process with the Correlation-based Feature Selection (CFS) feature selection technique alone cannot improve the accuracy level, while when combined with the Information Gain technique, the accuracy processing results will be better (higher).
- Conference Article
15
- 10.1109/pdgc.2018.8745830
- Dec 1, 2018
Diagnosis of diseases at an early stage is a crucial task in the medical field. A hybrid machine learning framework is presented for the diagnosis of breast cancer and diabetes using efficient feature selection and classification technique. This research identifies significant risk factors related to both chronic disease datasets by applying different feature selection techniques and hybridization of ReliefF Feature Ranking with Principal Component Analysis (PCA) method. To evaluate the effectiveness of the presented feature selection method, k-nearest neighbor method for classification is used. The hybridization enhances the accuracy of the classifier with the proposed feature selection technique for both chronic disease datasets. The performance of the presented hybrid framework is found to be best in comparison to five other techniques - Correlation Based feature Selection (CBS), Fast Correlation Based Feature Selection (FCBF), Mutual Information Based Feature Selection (MIFS), MODTree Filtering Approach and ReliefF Feature Selection. Moreover, the proposed ReliefF-PCA method eliminates 25% and 33.3% of irrelevant features for diabetes and breast cancer dataset respectively.
- Research Article
- 10.33003/fjs-2025-0901-2774
- Jan 31, 2025
- FUDMA JOURNAL OF SCIENCES
Regularized regression techniques such as the least absolute shrinkage and selection operator (LASSO), elastic-net, and the type 1 and type 2 correlation adjusted elastic-net (CAEN1 and CAEN2 respectively) are used for simultaneously carrying out variable selection and estimation of coefficients in machine learning. Modified estimators based on the CAEN1 and CAEN2 are proposed in this study by rescaling the estimates to undo the double shrinkage incurred due to the application of two penalties. The scale factors are derived by decomposing the correlation matrix of the predictors. The derived scale factors, which depend on the magnitude of correlations among the predictors, ensure that the elastic-net is included as a special case. Estimation is carried out using a robust worst-case quadratic solver algorithm. Simulations show that the proposed estimators referred to as corrected correlation adjusted elastic-net (CCAEN1 and CCAEN2) perform competitively with the CAEN1, CAEN2, LASSO, and elastic-net in terms of variable selection, estimation and prediction accuracy with CCAEN1 yielding the best results when the number of predictors is more than the number of observations and CCAEN2 producing the best performance when there is grouping effect, where highly correlated predictors tend to be included in or excluded from the model together. Applications to two real-life datasets further demonstrate the advantage of the proposed methods for machine learning.
- Research Article
- 10.1155/2024/9382390
- Jan 1, 2024
- The Scientific World Journal
Cancer is one of the leading causes of death across the globe. There is a need for early diagnosis to improve the chance of successful treatment and reduce the mortality associated with cancer. Due to the availability of highly specialized cancer datasets, molecular classification of cancer by gene expression, machine learning, and deep learning, a part of artificial intelligence (AI) techniques is used in detecting the disease. The application of several classification and feature selection methods on microarray gene expression datasets helps learn models that are able to predict a given disease. However, the tremendous dimensionality of the microarray cancer dataset is the greatest challenge in interpreting the data. In this work, the optimal feature subsets are selected by combining the correlation‐based feature selection (CFS) technique with five distinct meta‐heuristic search methods: evolutionary search (ES), particle swarm optimization search (PSOS), genetic search (GS), harmony search (HS), and multiobject evolutionary search (MOES). Furthermore, a CFS‐MOES (correlation‐based feature selection—multiobject evolutionary search) ensemble model is proposed based on a majority voting mechanism to improve the classification performance. Six microarray cancer datasets are considered, and seven traditional classifiers are evaluated on those datasets. Three classifiers, namely, K‐nearest neighbour (KNN), multilayer perceptron (MLP), and random forest (RF), were chosen as the base classifiers based on their F‐measure score. The features chosen by our proposed CFS‐MOES method significantly improve the accuracy of the proposed model. Moreover, the proposed model has also been compared with the other ensemble models generated using CFS‐ES (correlation‐based feature selection —evolutionary search), CFS‐PSOS (correlation‐based feature selection—particle swarm optimization search), CFS‐GS (correlation‐based feature selection—genetic search), and CFS‐HS (correlation‐based feature selection—harmony search) feature selection methods, ensuring better classification accuracy with a reduced feature subset. This model is also evaluated using significant parameters such as precision, recall, F‐measure, accuracy, Matthews correlation coefficient (MCC), and mean absolute error (MAE). According to the experimental results, our proposed model has a remarkable accuracy of 98.83% for breast cancer and 98.79% for cervical cancer.
- Research Article
23
- 10.1080/01431161.2019.1594435
- Mar 28, 2019
- International Journal of Remote Sensing
Geographic object-based image analysis (GEOBIA) has demonstrated strong capability compared with pixel-based algorithms for urban characterization studies. In this study, new solutions to feature selection (FS) and image segmentation optimization are investigated in the GEOBIA domain. First, the combination of Taguchi-based optimization technique and F-score segmentation quality measures was adopted to optimize the parameters of multiresolution segmentation (MRS) and determine the optimum multiscale combinations of MRS parameters. Second, artificial bee colony (ABC) FS was integrated to select the most relevant features. Third, random forest (RF) classification algorithm was utilized to extract multiscale urban land use/land cover (LULC) classes from geographically wide images obtained from two WorldView-3 image datasets. The proposed method was developed in the first study area and later applied to the second study area for validation. Results of image segmentation optimization indicated that scales 40 and 80 were the best for classification. The result of FS through ABC outperformed those of other FS techniques, including support vector machine with recursive feature elimination (SVM-REF), variable selection using RF, Boruta, genetic algorithm, correlation-based FS, and chi-square, with an overall accuracy (OA) of 88.46%. Among the 100 examined features, only 25 were significant. The RF classification results showed a kappa coefficient (κ) of 0.84 in the first study area. The transferability and scalability of the best-performing features based on ABC FS were evaluated in the second study area, which covered a geographically wide scene of 162. The results for the second study area obtained an OA of 86.78% and a κ of 0.82. The proposed integrated method is an efficient and promising technique for high-quality LULC mapping of geographically wide areas.
- Research Article
48
- 10.1016/j.eswa.2023.119806
- Mar 17, 2023
- Expert Systems with Applications
Context:The application of Software Fault Prediction (SFP) in the software development life cycle to predict the faulty class at the early stage has piqued the interest of various scholars. In the SFP domain, during research analysis, it got realized that there has been very little work instigated on addressing both class imbalance and feature redundancy problems jointly to enhance the performance and prediction accuracy of SFP models. It has been perceived in the literature survey the study of droughts with the comprehensive comparative analysis of different sampling and feature selection strategies together. Objective:This research builds an extensive assessment of distinct combinations of different feature selection and sampling approaches, to effectively overcome the problems of class overlap, class imbalance, and feature redundancy. The objective is to determine the best combination that will produce results with a higher degree of accuracy and an effective SFP model. Method:Considering the above erudition, the study has applied 8 different sampling techniques along with 10 feature selection algorithms against 56 open-source projects. The comparative analysis is performed against 5346 variants of input datasets by applying 8 different classifiers to predict the faulty class. In addition, the research paper presents an intensive assessment and performance of these techniques individually against all the input projects. We have considered accuracy and Area Under the ROC (receiver operating characteristic curve) Curve (AUC) performance metrics to compare the performance of different models developed using the classification algorithm. Result:For each project in the proposed work, we evaluated a total of 792 combinations that were produced using 10 feature selection methods, 1 all metrics dataset, 8 sampling methods, 1 original, unsampled dataset, and 8 classifiers. The empirical result indicates that, against 21 projects out of 54 projects, Synthetic Minority Over Sampling Technique Edited (SMOTEE) with correlation-based feature selection (FS2) combination outperformed with the highest AUC value which is 38.89 % of projects. Additionally, according to experimental results, the highest AUC values were attained by 24.07 % of projects using the SMOTEE, FS2, and RF combination. Conclusion:The results of the statical analysis test reveal that 93.42 % of the combinational pairs of different sampling and feature selection approaches demonstrated a significant variance in the performance of the distinct combinations of sampling and feature selection techniques. The empirical result indicates the performance of the SFP Model is adversely impacted by class imbalance and irrelevance. The outcome indicates for more than 75% of projects, the performance of trained models improved with an AUC value between a range of 0.805 to 0.99 post-application of sampling and feature selection strategies, in comparison without the use of feature selection and sampling techniques.
- Research Article
1
- 10.15575/join.v9i1.1307
- Apr 23, 2024
- Jurnal Online Informatika
Software defect prediction (SDP) is used to identify defects in software modules that can be a challenge in software development. This research focuses on the problems that occur in Particle Swarm Optimization (PSO), such as the problem of noisy attributes, high-dimensional data, and premature convergence. So this research focuses on improving PSO performance by using feature selection methods with hybrid techniques to overcome these problems. The feature selection techniques used are Filter and Wrapper. The methods used are Chi-Square (CS), Correlation-Based Feature Selection (CFS), and Forward Selection (FS) because feature selection methods have been proven to overcome data dimensionality problems and eliminate noisy attributes. Feature selection is often used by some researchers to overcome these problems, because these methods have an important function in the process of reducing data dimensions and eliminating uncorrelated attributes that can cause noisy. Naive Bayes algorithm is used to support the process of determining the most optimal class. Performance evaluation will use AUC with an alpha value of 0.050. This hybrid feature selection technique brings significant improvement to PSO performance with a much lower AUC value of 0.00342. Comparison of the significance of AUC with other combinations shows the value of FS PSO of 0.02535, CFS FS PSO of 0.00180, and CS FS PSO of 0.01186. The method in this study contributes to improving PSO in the SDP domain by significantly increasing the AUC value. Therefore, this study highlights the potential of feature selection with hybrid techniques to improve PSO performance in SDP.
- Research Article
4
- 10.1142/s2972370123500010
- Jan 1, 2023
- Computing Open
Many efforts have already been carried out for education mining in the past. Many techniques and models are already developed to predict and identify students’ performance, learning behavior, status, and education level. However, there is no exact solution for the student’s result prediction since it is affected by the level of the student, the field of study, the location of the data collection, different sizes and nature of data, etc. Different research shows that there can be up to 10% difference in the accuracy of results with and without a feature selection process. Thus, the proposed model designs a better model for student result prediction using feature selection and deep learning techniques. The proposed dissertation task compares and analyzes Correlation-based Feature Selection (CFS), Chi-Square ([Formula: see text]), Genetic Algorithm (GA), Information Gain (IG), Maximum Relevance Minimum Redundancy (mRMR), ReliefF, and Recursive Feature Elimination (RFE) feature selection techniques with the Classification and Regression Tree (CART), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Convolution Neural Network (CNN) and Long Short-Term Memory (LSTM) machine learning algorithms. In the proposed model, a feature selection process CFS and prediction using CNN is recommended. The recommended model (CFS–CNN) is tested with a primary dataset collected from bachelor-level students. The recommended model provides improved performance compared to old techniques. The major contribution of the proposed dissertation is to design a better model for the prediction of students’ results using demographic data and past examination results.
- Research Article
1
- 10.33395/sinkron.v9i1.13293
- Jan 10, 2024
- Sinkron
Entrepreneurs are critical to a country's economic progress and job creation. Few people felt schools have much to offer with business a generation ago. Students are expected to be an entrepreneur as the outcome of the course. The goal of this study is building a model to predict students' future employment, particularly in the field of entrepreneurship, using big data analysis and data mining. Various educational institutions can use data mining methodologies to identify hidden patterns in data contained in databases. The feature selection technique was utilised in this study to select and assess the significance of each element. The model was built using the final parameters determined by the feature selection technique (Correlation Based Feature Selection). Using the 10-fold cross validations for training and testing dataset distribution, the Naïve Bayes classifier was used to forecast the students' future of work. The dataset for the study was gathered from a student's performance report at Universitas Negeri Medan's engineering department. The effectiveness of using feature selection algorithms was compared to the effectiveness of not using feature selection algorithms, and the results are discussed. According to the findings of this study, the accuracy of Naïve Bayes with Correlation Based Feature Selection is 87.4%, which is higher than the model that did not use any feature selection. It was also discovered that the overall accuracy of the Correlation Based Feature Selection and Naïve Bayes Classifier models appears to be higher than that of the other treatments.
- Book Chapter
2
- 10.1007/978-3-642-22910-7_6
- Jan 1, 2011
PC and TPDA algorithms are robust and well known prototype algorithms, incorporating constraint-based approaches for causal discovery. However, both algorithms cannot scale up to deal with high dimensional data, that is more than few hundred features. This chapter presents hybrid correlation and causal feature selection for ensemble classifiers to deal with this problem. Redundant features are removed by correlation-based feature selection and then irrelevant features are eliminated by causal feature selection. The number of eliminated features, accuracy, the area under the receiver operating characteristic curve (AUC) and false negative rate (FNR) of proposed algorithms are compared with correlation-based feature selection (FCBF and CFS) and causal based feature selection algorithms (PC, TPDA, GS, IAMB).
- Book Chapter
2
- 10.5772/intechopen.100506
- Apr 6, 2022
In high-dimensional data, penalized regression is often used for variable selection and parameter estimation. However, these methods typically require time-consuming cross-validation methods to select tuning parameters and retain more false positives under high dimensionality. This chapter discusses sparse boosting based machine learning methods in the following high-dimensional problems. First, a sparse boosting method to select important biomarkers is studied for the right censored survival data with high-dimensional biomarkers. Then, a two-step sparse boosting method to carry out the variable selection and the model-based prediction is studied for the high-dimensional longitudinal observations measured repeatedly over time. Finally, a multi-step sparse boosting method to identify patient subgroups that exhibit different treatment effects is studied for the high-dimensional dense longitudinal observations. This chapter intends to solve the problem of how to improve the accuracy and calculation speed of variable selection and parameter estimation in high-dimensional data. It aims to expand the application scope of sparse boosting and develop new methods of high-dimensional survival analysis, longitudinal data analysis, and subgroup analysis, which has great application prospects.
- Research Article
2
- 10.3233/ida-215825
- Mar 14, 2022
- Intelligent Data Analysis
Software maintainability is a significant contributor while choosing particular software. It is helpful in estimation of the efforts required after delivering the software to the customer. However, issues like imbalanced distribution of datasets, and redundant and irrelevant occurrence of various features degrade the performance of maintainability prediction models. Therefore, current study applies ImpS algorithm to handle imbalanced data and extensively investigates several Feature Selection (FS) techniques including Symmetrical Uncertainty (SU), RandomForest filter, and Correlation-based FS using one open-source, three proprietaries and two commercial datasets. Eight different machine learning algorithms are utilized for developing prediction models. The performance of models is evaluated using Accuracy, G-Mean, Balance, & Area under the ROC Curve. Two statistical tests, Friedman Test and Wilcoxon Signed Ranks Test are conducted for assessing different FS techniques. The results substantiate that FS techniques significantly improve the performance of various prediction models with an overall improvement of 18.58%, 129.73%, 80.00%, and 45.76% in the median values of Accuracy, G-Mean, Balance, & AUC, respectively for all the datasets taken together. Friedman test advocates the supremacy of SU FS technique. Wilcoxon Signed Ranks test showcases that SU FS technique is significantly superior to the CFS technique for three out of six datasets.
- Conference Article
5
- 10.1109/icosnikom56551.2022.10034873
- Oct 19, 2022
Timely graduation is a problem that is often experienced by study programs at higher education institutions, where several factors can be the cause. This study applies data mining feature selection techniques to analyze attributes from student academic data which are likely affecting students' on-time graduation. The feature selection techniques used are Correlation-based Feature Selection, Information Gain Based Feature Selection, and Learner Based Feature Selection. The accuracy of each feature selection method is measured using the Naïve Bayes classification algorithm. The results of the classification test using Naïve Bayes with the application of feature selection using Correlation-based Feature Selection and Information Gain Based Feature Selection get almost the same level of accuracy as the classification test using Naïve Bayes without the application of feature selection, but the application of feature selection using Learner Based Feature Selection in the Naive Bayes algorithm, when reducing the number of features there is a possibility of increasing accuracy by eliminating features that have little relevance, namely 70.06% from 66.53%.
- Research Article
7
- 10.1038/s41598-024-58241-1
- Apr 3, 2024
- Scientific Reports
Predictive modelling of cancer outcomes using radiomics faces dimensionality problems and data limitations, as radiomics features often number in the hundreds, and multi-institutional data sharing is ()often unfeasible. Federated learning (FL) and feature selection (FS) techniques combined can help overcome these issues, as one provides the means of training models without exchanging sensitive data, while the other identifies the most informative features, reduces overfitting, and improves model interpretability. Our proposed FS pipeline based on FL principles targets data-driven radiomics FS in a multivariate survival study of non-small cell lung cancer patients. The pipeline was run across datasets from three institutions without patient-level data exchange. It includes two FS techniques, Correlation-based Feature Selection and LASSO regularization, and Cox Proportional-Hazard regression with Overall Survival as endpoint. Trained and validated on 828 patients overall, our pipeline yielded a radiomic signature comprising "intensity-based energy" and "mean discretised intensity". Validation resulted in a mean Harrell C-index of 0.59, showcasing fair efficacy in risk stratification. In conclusion, we suggest a distributed radiomics approach that incorporates preliminary feature selection to systematically decrease the feature set based on data-driven considerations. This aims to address dimensionality challenges beyond those associated with data constraints and interpretability concerns.
- Research Article
7
- 10.1186/s12859-021-04232-2
- Jul 6, 2021
- BMC Bioinformatics
BackgroundMicrobiome studies have uncovered associations between microbes and human, animal, and plant health outcomes. This has led to an interest in developing microbial interventions for treatment of disease and optimization of crop yields which requires identification of microbiome features that impact the outcome in the population of interest. That task is challenging because of the high dimensionality of microbiome data and the confounding that results from the complex and dynamic interactions among host, environment, and microbiome. In the presence of such confounding, variable selection and estimation procedures may have unsatisfactory performance in identifying microbial features with an effect on the outcome.ResultsIn this manuscript, we aim to estimate population-level effects of individual microbiome features while controlling for confounding by a categorical variable. Due to the high dimensionality and confounding-induced correlation between features, we propose feature screening, selection, and estimation conditional on each stratum of the confounder followed by a standardization approach to estimation of population-level effects of individual features. Comprehensive simulation studies demonstrate the advantages of our approach in recovering relevant features. Utilizing a potential-outcomes framework, we outline assumptions required to ascribe causal, rather than associational, interpretations to the identified microbiome effects. We conducted an agricultural study of the rhizosphere microbiome of sorghum in which nitrogen fertilizer application is a confounding variable. In this study, the proposed approach identified microbial taxa that are consistent with biological understanding of potential plant-microbe interactions.ConclusionsStandardization enables more accurate identification of individual microbiome features with an effect on the outcome of interest compared to other variable selection and estimation procedures when there is confounding by a categorical variable.