Integrating fuzzy C-means clustering and Random Forest for multivariate performance prediction in vocational education
This study combines Fuzzy C-Means clustering and Random Forest regression to analyze multivariate factors influencing student job performance in vocational education using a simulated dataset. The approach identified three latent clusters and achieved a classification accuracy of 38.9% with an MSE of 266.65, demonstrating methodological feasibility despite limited predictive power due to data constraints.
This technical note presents the integration of unsupervised and supervised machine learning methods—Fuzzy C-Means (FCM) clustering and Random Forest regression—for analyzing multivariate determinants of student job performance in vocational education. Using a simulated dataset (N = 300) with seven variables, FCM identified three latent clusters with moderate partition clarity (Partition Coefficient = 0.333; Partition Entropy = 1.099). Random Forest achieved a Mean Squared Error (MSE) of 266.65 and classification accuracy of 38.9% in predicting categorical performance. While predictive power was limited due to simulated and imbalanced data, this framework demonstrates methodological feasibility and highlights key predictors such as confidence, motivation, and supervisor evaluation, serving primarily as a methodological demonstration rather than empirical validation.
- Research Article
2
- 10.15598/aeee.v18i1.3328
- Mar 30, 2020
- Advances in Electrical and Electronic Engineering
The paper presents histogram-based initialzation of Fuzzy C Means (FCM) clustering algorithm for remote sensing image analysis. The drawback of well known FCM clustering is sensitive to the choice of initial cluster centers. In order to overcome this drawback, the proposed algorithm, first, determines the optimal initial cluster centers by maximizing the histogram-based weight function. By using these initial cluster centers, the given image is segmented using fuzzy clustering. The major contribution of the proposed method is the automatic initialization of the cluster centers and hence, the clustering performance is enhanced. Also, it is empirically free of experimentally set parameters. Experiments are performed on remote sensing images and cluster validity indices Davies-Bouldin, Partition index, Xie-Beni, Partition Coefficient and Partition Entropy are computed and compared with prominent methods such as FCM, K-Means, and automatic histogram based FCM. The experimental outcomes show that the proposed method is competent for remote sensing image segmentation.
- Research Article
2
- 10.21015/vtse.v11i4.1657
- Dec 2, 2023
- VFAST Transactions on Software Engineering
Diabetes claims the lives of thousands each year, and many individuals remain oblivious to their condition until it reaches a critical stage. This study presents a data mining-based approach aimed at enhancing the early detection and prediction of diabetes, utilizing data from the Pima Indian Diabetes dataset. Despite the adaptability of fuzzy C-Means for various data types, the ultimate outcome of the clustering process hinges on the initial placement of cluster centers. Additionally, precision in data clustering is crucial; it can furnish either extensive, well-grouped data for the random forest or limited data, constraining its efficacy. Our principal objective was to enhance the accuracy of fuzzy C-means clustering and the random forest. To boost the model's performance, we incorporated PCA, fuzzy c-means, and the Random Forest approach. Various algorithmic combinations were employed, and the results unequivocally demonstrate that our model surpasses the original outcomes of the Pima Indian Diabetes Dataset in terms of accuracy. The diabetic prediction model achieved a remarkable accuracy of 97.40\% through the utilization of PCA, logistic regression, and K-Means. However, when employing PCA in conjunction with fuzzy C-means and random forests, an even higher accuracy of 98.96\% was attained. Empirical evidence confirms that the implementation of PCA significantly enhanced the accuracy of both the fuzzy C-means clustering approach and the random forest classifier, deviating from previous findings. To improve the model's performance, we used PCA, fuzzy c-means, and the Random Forest approach. Different algorithm combinations were used, and the results clearly show that our model outperforms the original Pima Indian Diabetes Dataset outcomes in terms of accuracy. The diabetic prediction model was improved to 97.40% accuracy using PCA, logistic regression, and K -Means. Using PCA with fuzzy C-means and random forests, however, we achieved an accuracy of 98.96%. Based on empirical evidence, it has been demonstrated that the implementation of PCA improved the accuracy of the fuzzy C-means clustering approach and the random forest classifier. These findings differ from previous findings.
- Research Article
64
- 10.1016/j.asoc.2020.106200
- Mar 3, 2020
- Applied Soft Computing
Local segmentation of images using an improved fuzzy C-means clustering algorithm based on self-adaptive dictionary learning
- Research Article
- 10.1371/journal.pone.0318491
- Mar 11, 2025
- PloS one
To enhance the accuracy and response speed of the risk early warning system, this study develops a novel early warning system that combines the Fuzzy C-Means (FCM) clustering algorithm and the Random Forest (RF) model. Firstly, based on operational risk theory, market risk, research and development risk, financial risk, and human resource risk are selected as the primary indicators for enterprise risk assessment. Secondly, the Criteria Importance Through Intercriteria Correlation (CRITIC) weight method is employed to determine the importance of these risk indicators, thereby enhancing the model's prediction ability and stability. Following this, the FCM clustering algorithm is utilized for pre-processing sample data to improve the efficiency and accuracy of data classification. Finally, an improved RF model is constructed by optimizing the parameters of the RF algorithm. The data selected is mainly from RESSET/DB, covering the issuance, trading, and rating data of fixed-income products such as bonds, government bonds, and corporate bonds, and provides basic information, net value, position, and performance data of funds. The experimental results show that the model achieves an F1 score of 87.26%, an accuracy of 87.95%, an Area under the Curve (AUC) of 91.20%, a precision of 89.29%, and a recall of 87.48%. They are respectively 6.45%, 4.45%, 5.09%, 4.81%, and 3.83% higher than the traditional RF model. In this study, an improved RF model based on FCM clustering is successfully constructed, and the accuracy of risk early warning models and their ability to handle complex data are significantly improved.
- Research Article
206
- 10.1016/j.patcog.2005.07.005
- Oct 12, 2005
- Pattern Recognition
Unsupervised possibilistic clustering
- Conference Article
6
- 10.1109/bracis.2018.00102
- Oct 1, 2018
Fuzzy clustering validation of high-dimensional data sets is only possible using a reliable cluster validity index. Therefore, the selection of an index is as important as choosing an appropriate clustering algorithm. A good validity index is that one that correctly recognize the data structure by choosing its correct number of clusters, and it is not sensitive to any parameter of the clustering algorithm or data property. However, some classical fuzzy validity indices as Partition Coefficient (PC), Partition Entropy (PE) and Fukuyama-Sugeno (FS) are sensitive to the fuzzification factor m and the number of clusters c, both parameters of the well-known Fuzzy c-Means (FCM) algorithm. They present the monotonic tendency in function of c even varying the values of m: the PC and FS values become smaller when c increases and the opposite occurs with PE. Although the literature presents extensive investigations about such tendency, they were conducted for low-dimensional data, in which such data property does not affect the clustering behavior. In order to investigate how such aspects affect the fuzzy clustering results of high-dimensional data, in this work we have clustered objects of ten real high-dimensional data sets, using FCM validated by PC, PE, FS and some proposed modifications of them to lead with the monotonic tendency. The results showed that the Modified Partition Coefficient (MPC) is the more reliable index to validate fuzzy clustering of high-dimensional data.
- Research Article
4
- 10.9717/kmms.2013.16.7.810
- Jul 31, 2013
- Journal of Korea Multimedia Society
FCM 클러스터링 알고리즘은 대표적인 분할기반 군집화 알고리즘이며 다양한 분야에서 성공적으로 적용되어 왔다. 그러나 FCM 클러스터링 알고리즘은 잡음 및 지역 데이터에 대한 높은 민감도, 직관적인 결과와 상이한 결과 도출 가능성이 높은 문제, 초기 원형과 클러스터 개수 설정 문제 등이 존재한다. 본 논문에서는 FCM 알고리즘의 결과를 해당 속성의 데이터 축에 사상하여 퍼지구간을 결정하고, 결정된 퍼지구간을 FDT에 적용함으로써 FCM 알고리즘이 가지는 문제 중 잡음 및 데이터에 대한 높은 민감도, 직관적인 결과와 상이한 결과 도출 가능성이 높은 문제를 개선하는 시스템을 제안한다. 또한 실제 교통데이터와 강수량 데이터를 이용한 실험을 통하여 제안 모델과 FCM 클러스터링 알고리즘을 비교한다. 실험 결과를 통해 제안 모델은 잡음 및 데이터에 대한 민감도를 완화시킴으로써 보다 안정적인 결과를 제공하며, FCM 클러스터링 알고리즘을 적용한 시스템보다 직관적인 결과와의 일치율을 높여줌을 알 수 있다. FCM (Fuzzy C-Means) clustering algorithm, a typical split-based clustering algorithm, has been successfully applied to the various fields. Nonetheless, the FCM clustering algorithm has some problems, such as high sensitivity to noise and local data, the different clustering result from the intuitive grasp, and the setting of initial round and the number of clusters. To address these problems, in this paper, we determine fuzzy numbers which project the FCM clustering result on the axis with the specific attribute. And we propose a model that the fuzzy numbers apply to FDT (Fuzzy Decision Tree). This model improves the two problems of FCM clustering algorithm such as elevated sensitivity to data, and the difference of the clustering result from the intuitional decision. And also, this paper compares the effect of the proposed model and the result of FCM clustering algorithm through the experiment using real traffic and rainfall data. The experimental results indicate that the proposed model provides more reliable results by the sensitivity relief for data. And we can see that it has improved on the concordance of FCM clustering result with the intuitive expectation.
- Research Article
17
- 10.11113/matematika.v24.n.536
- Dec 1, 2008
- Mathematika
In fuzzy C-means (FCM) clustering, each data point belongs to a cluster to a degree specified by a membership grade. FCM partitions a collection of vectors in c fuzzy groups and finds a cluster center in each group such that the dissimilarity measure is minimized. This paper presents a training algorithm for the radial basis function (RBF) network using symmetry-based Fuzzy C-means (SFCM) clustering method which is the modified version of FCM clustering method based on point symmetry distance measure. The training algorithm which uses SFCM clustering method to train the network has a number of advantages such as faster training time, more accurate predictions and reduced network architecture compared to the standard RBF networks. The proposed training algorithm has been implemented in the RBF networks created by the newrb function of MATLAB which uses gradient based iterative method as learning strategy, therefore the new network will undergo a hybrid learning process. The networks called Symmetry-based Fuzzy C-means Clustering–Radial Basis Function Network (SFCM/RBF) has been tested against the standard RBF network and the networks called standard Fuzzy C-means Clustering (FCM)-RBF network (FCM/RBF) in forecasting. The experimental models has been tested on three real world application problems, particularly in Air pollutant problem, Biochemical Oxygen Demand (BOD) problem, and Phytoplankton problem. Keywords: Fuzzy c-means clustering; SFCM; Radial basis function network; point symmetry distance; forecasting.
- Conference Article
3
- 10.1109/iembs.2000.900556
- Jul 23, 2000
When quantifying the perfusion parameters such as cerebral blood flow (CBF) using dynamic susceptibility contrast-enhanced magnetic resonance imaging (DSC-MRI), the arterial input function (AIF) of contrast agent has to be determined. In this study, we developed a method for obtaining the AIF automatically using fuzzy c-means (FCM) clustering. First, a mask region of interest (ROI) was drawn around the internal carotid artery. Second, FCM clustering was applied to the data in this ROI and the cluster centroids were calculated. The cluster centroid with the highest maximum concentration, earliest maximum concentration and smallest FWHM of the time-concentration curve (TCC) was determined as the arterial pixels and the AIF was obtained from the mean TCC in these pixels. We applied this method to six subjects and compared it with a manual ROI method. The difference between the CBF values calculated using the AIF obtained by FCM clustering [CBF(fuzzy)] and that obtained by the manual ROI method [CBF(manual)] ranged from 0.92% to 122% [38.6/spl plusmn/37.7% (mean/spl plusmn/SD)]. The CBF(manual) values were generally overestimated compared with the CBF(fuzzy) values, while the CBF(fuzzy) values became closer to the CBF values found in the literature. In conclusion, FCM clustering appears to be promising for determination of AIF, because it allows automatic, rapid and accurate extraction of arterial pixels.
- Research Article
13
- 10.1109/access.2020.3030083
- Jan 1, 2020
- IEEE Access
Multiple SVR based on ensemble learning could be enhanced from the viewpoint of the performance, but the performance of modeling closely depends on the initial condition of the partitioning method and they are easily affected by noise and outliers. In this study, a multi-linear fuzzy support vector regression (MFSVR) robust to noise is proposed with the aid of the composite kernel function and $\varepsilon $ -fuzzy c-means (FCM) clustering based on insensitive data information. Here insensitive data information stands for the interval data information of “$\varepsilon $ ” which stands for insensitive loss parameter used in the $\varepsilon $ - insensitive loss function. The objective of this study is to reduce the effect of noise and to alleviate the overfitting problem through the synergistic effect of the following methods: First, $\varepsilon $ -FCM clustering based on insensitive data information is used for considering more impact on decision boundary and reducing the effect of noise. Second, the composite kernel based on multiple linear kernel expression is proposed for implementing multi-linear decision boundary to alleviate overfitting problem. In more detail, each training data point is assigned with corresponding membership degrees in the $\varepsilon $ -FCM clustering. Some data which are potentially to be noise or outlier are assigned with lower membership degrees and given small contribution (compensation) considered in composite kernel function. Then, the composite kernel function for multiple local SVRs is constructed according to the distribution characteristics of $\varepsilon $ -FCM clustering. The proposed MFSVR is tested with both synthetic and UCI data sets in order to verify the effectiveness as well as efficient performance improvement. Experimental results demonstrate that the proposed method shows the better performance when compared to other some methods studied so far.
- Conference Article
8
- 10.1109/foci.2007.371510
- Apr 1, 2007
A generalized fuzzy c-means (FCM) clustering is proposed by modifying the standard FCM objective function and introducing some simplifications. FCM clustering results in very fuzzy partitions for data points that are far from all cluster centroids. This property distinguishes FCM from Gaussian mixture models or entropy based clustering. The generalized FCM clustering aims at aggregating standard FCM and entropy based FCM so that the generalized algorithm is furnished with the two distinctive properties for data points that are far from all centroids and for those that are close to any centroid. k-Harmonic means clustering are reviewed from the view point of FCM clustering. Graphical comparisons of the four classification functions are presented
- Research Article
67
- 10.1007/s10044-019-00783-6
- Mar 6, 2019
- Pattern Analysis and Applications
Data distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related works of k-means and FCM clustering. Then, the structure decomposition analysis of the objective functions of k-means and FCM is presented. Afterward, extensive experiments on both synthetic two-dimensional and three-dimensional data sets and real-world data sets from the UCI machine learning repository are conducted. The results demonstrate that FCM has stronger uniform effect than k-means clustering. Also, it reveals that the fuzzifier value m = 2 in FCM, which has been widely adopted in many applications, is not a good choice, particularly for data sets with great variation in cluster sizes. Therefore, for data sets with significant uneven distributions in cluster sizes, a smaller fuzzifier value is preferred for FCM clustering, and k-means clustering is a better choice compared with FCM clustering.
- Conference Article
9
- 10.1109/pcspa.2010.245
- Sep 1, 2010
The fuzzy c-means (FCM) clustering method was applied to the neutron/gamma discrimination of the pulses from the liquid scintillator. An experimental setup termed the portable real-time n/γ discriminator with a BC-501A liquid scintillator detector was used to collect waveforms with a 500 Ms/s, 12 bit sampling ADC. The FCM clustering and PGA were applied to the same pulses dataset respectively and the results were compared to each other. Compared to the PGA, the FCM clustering decreased the uncertainty thus improved the discrimination performance. The implementation of FCM clustering in the digital devices also reduced the cost and simplified the algorithm.
- Conference Article
6
- 10.1109/isriti.2018.8864459
- Nov 1, 2018
- 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI)
Clustering classifies data into groups based on the similarity of each element of data. In order to validate the cluster, cluster validity index is introduced. Hybrid SC-FCM (Subtractive Clustering-Fuzzy C-Means) clustering method is a clustering technique to overcome the weakness of the FCM (Fuzzy C-Means) clustering. While the hybrid SC-FCM is a promising method, no validity measurement on the resulted cluster has been done. This research measures the cluster validity index of Hybrid SC-FCM method. The cluster validity indices used in the research are partition coefficient, partition entropy, and Xen Beni Index. The research shows mix results. Even though the Hybrid SC-FCM method fails to find the best number of clusters as suggested, it shows that hybrid SC-FCM able to exceed the traditional FCM method in providing initial centroids.
- Research Article
8
- 10.1080/01969729208927483
- Nov 1, 1992
- Cybernetics and Systems
In this paper the existence and strong consistency of a class of fuzzy c-means (FCM) clustering procedures are established. Suppose that the data set is a simple random sample of observations from a probability distribution, assuming that the second moment of the distribution is finite. Then the solution to the FCM clustering procedures shall exist. The FCM cluster centers and the FCM membership functions will have the property of strong consistency. That is, the sample FCM cluster centers and the population FCM cluster centers will be close to each other with probability one, and also the sample FCM membership functions and the population FCM membership functions will be close to each other with probability one when the sample size increases to infinity.