PENERAPAN DATA MINING UNTUK PEMETAAN DAERAH RAWAN BENCANA SEBAGAI UPAYA KESIAPSIAGAAN TERHADAP BENCANA
Disasters have a major impact on several sectors, such as infrastructure, manufacturing, tourism and transportation. One way to prepare for or improve disaster preparedness is to implement preventive measures. Preventive actions can be taken by identifying disasters in each area from past data. This study aims to map areas affected by disasters to facilitate disaster preparedness programs. The data used in this research are areas of West Java that will be affected by the disaster in 2022 from January to October. The disaster data used in this study are floods, landslides, abrasion, tornadoes, droughts, fires, earthquakes and tsunamis. Research to use data mining techniques, namely grouping techniques. The clustering algorithm used in this study is the K-means cluster. The clustering process was carried out several times to find out the comparison of the quality of the grouping results which in this study used the Within Cluster Sum of Squares (WSS). The best WSS value is when the number of k or the number of clusters is 5, which is 89.8%. This research is expected to be a reference for disaster preparedness. This research also produced disaster grouping maps, where each cluster has different characteristics or types of disaster.
- Conference Article
21
- 10.1063/5.0108926
- Jan 1, 2022
- AIP conference proceedings
The grouping of potential students conducts to determine the student's interest and increase the student's academic performance. The K-Means algorithm could do collection or clusterization. This study aims to implement one of the Machine Learning algorithms, K-Means, to classify the potential of interest grouping of Informatics Engineering student's batch 2019 at the Universitas Muhammadiyah Purwokerto. The process of categorization was based on average course values, which are a part of student specializations, namely 1) Intelligent Systems (IS), 2) Software Engineering (SE), 3) Computer Networks (CN), and 4) Multimedia (MM), as well as student's GPA data (semester 1 to semester 4). Moreover, this research involves the Elbow method for determining the number of optimal clusters and Sum of Squared Errors (SSE) as a cluster validation technique. From the Elbow process, Within Cluster Sum of Squares (WCSS) significantly decreases when K is significantly upwards from 2 to 3, and the SSE maximum rate of change is 71.29 %. Therefore, the optimal cluster is 3. With K-Means clustering results, the majority of the students (62 or 41.05 %) are assigned to the Intelligent System group, the second majority (59 or 39.07 %) to the Multimedia group. At the same time, a cluster of Computer Networks was the group with the fewest members.
- Book Chapter
21
- 10.1007/978-3-319-23219-5_39
- Jan 1, 2015
The Within-Cluster Sum of Squares (WCSS) is the most used criterion in cluster analysis. Optimizing this criterion is proved to be NP-Hard and has been studied by different communities. On the other hand, Constrained Clustering allowing to integrate previous user knowledge in the clustering process has received much attention this last decade. As far as we know, there is a single approach that aims at finding the optimal solution for the WCSS criterion and that integrates different kinds of user constraints. This method is based on integer linear programming and column generation. In this paper, we propose a global optimization constraint for this criterion and develop a filtering algorithm. It is integrated in our Constraint Programming general and declarative framework for Constrained Clustering. Experiments on classic datasets show that our approach outperforms the exact approach based on integer linear programming and column generation.
- Research Article
- 10.1093/ecco-jcc/jjae190.0027
- Jan 22, 2025
- Journal of Crohn's and Colitis
Background Identifying molecular subtypes of IBD is essential to address inconsistencies in gene expression-based classifications, clinical variability, and treatment responses in Crohn's Disease (CD) and Ulcerative Colitis (UC). Building on prior efforts using different methods and datasets, this study aimed to derive and validate IBD subtypes using transcriptomics data and unsupervised machine learning. Methods This study analysed RNA-sequenced data from inflamed and non-inflamed intestinal biopsies of 2,490 adult IBD patients. K-means clustering, guided by Within Cluster Sum of Squares (WCSS) to determine the optimal ‘K’, identified subtypes within the dataset. Distinct clusters for UC and CD were derived from gene expression, with gene set enrichment and network analysis characterizing their features. Statistical tests (Chi-squared and ANOVA) linked these clusters to clinical data for UC and CD. Results K-means clustering revealed three distinct clusters in UC and CD, whose significant association with IBD severity (UC: p = 0.000263; CD: p = 0.007006) and IBD region (p < 0.000001) was determined by Chi Squared test. ANOVA showed age significantly influenced UC clusters (p = 0.0000345) but not CD clusters (p = 0.285). In UC, Cluster 1 focused on RNA processing, DNA repair, and rapid cell turnover, with upregulation of EXOSC genes and other related genes. Cluster 2 highlighted autophagy, stress response, and signaling processes, with upregulated expression of ATG13, VPS37C, and DVL2. Cluster 3 emphasized cytoskeletal stability over metabolic activity, marked by the upregulation of SRF, SRC, and ABL1. Notably, all UC clusters demonstrated upregulation of COX1, TMSB10, and ACTB. In CD, Cluster 1 was defined by cytoskeletal dynamics and reduced protein synthesis, with upregulated expression of CFL1, F11R, and RAD23A. Cluster 2 exhibited increased protein synthesis and stress response pathways, associated with aggressive disease phenotypes, with upregulation of MTREX, SART3, and GTF3C3. Cluster 3 prioritised cytoskeletal organisation over metabolism, featuring upregulation of TESK1, ABL1, and DVL2, along with other genes. Across all CD clusters, COX1, CDH1, and SF3B1 were consistently up-regulated. Conclusion Despite certain limitations, this study categorizes UC and CD into three transcriptomics-based subtypes, identifying meaningful IBD-associated patterns, key subtyping genes, and insights into the disease's complex pathogenesis. These findings may advance new therapeutic strategies and personalized medicine for patients with distinct IBD subtypes.
- Conference Article
1
- 10.1145/3325917.3325932
- Apr 6, 2019
In this paper, we describe a clustering analysis on 77 distinct brain protein expression levels of trisomic and control mice. Hierarchical clustering based on Euclidean distance results in clusters that partially coincide with experimental treatment groups of mice, as shown in dendrogram results. Normalization results in decreased within- and between-cluster sum of squares and a decreased ratio of between- to within-cluster sum of squares. The optimal number of clusters ranges from 1 to 4 clusters as determined by the gap statistic method or direct methods of the silhouette width or the elbow of total within-cluster sum of squares. Principal components analysis shows separation of clustered groups generated by k-means clustering. When clustered groups are plotted against the first two principal components, more distinct clusters are generated after z-score normalization of protein expression levels, compared to non-normalized results.
- Research Article
2
- 10.3390/jlpea15020021
- Apr 9, 2025
- Journal of Low Power Electronics and Applications
Approximate computation has emerged as a promising alternative to accurate computation, particularly for applications that can tolerate some degree of error without significant degradation of the output quality. This work analyzes the application of approximate computing for machine learning, specifically focusing on k-means clustering, one of the more widely used unsupervised machine learning algorithms. The k-means algorithm partitions data into k clusters, where k also denotes the number of centroids, with each centroid representing the center of a cluster. The clustering process involves assigning each data point to the nearest centroid by minimizing the within-cluster sum of squares (WCSS), a key metric used to evaluate clustering quality. A lower WCSS value signifies better clustering. Conventionally, WCSS is computed with high precision using an accurate adder. In this paper, we investigate the impact of employing various approximate adders for WCSS computation and compare their results against those obtained with an accurate adder. Further, we propose a new approximate adder (NAA) in this paper. To assess its effectiveness, we utilize it for the k-means clustering of some publicly available artificial datasets with varying levels of complexity, and compare its performance with the accurate adder and many other approximate adders. The experimental results confirm the efficacy of NAA in clustering, as NAA yields WCSS values that closely match or are identical to those obtained using the accurate adder. We also implemented hardware designs of accurate and approximate adders using a 28 nm CMOS standard cell library. The design metrics estimated show that NAA achieves a 37% reduction in delay, a 22% reduction in area, and a 31% reduction in power compared to the accurate adder. In terms of the power-delay product that serves as a representative metric for energy efficiency, NAA reports a 57% reduction compared to the accurate adder. In terms of the area-delay product that serves as a representative metric for design efficiency, NAA reports a 51% reduction compared to the accurate adder. NAA also outperforms several existing approximate adders in terms of design metrics while preserving clustering effectiveness.
- Conference Article
6
- 10.1109/icdsba.2018.00022
- Sep 1, 2018
In this paper, the selection of clustering number in K-means clustering algorithm is studied, based on the Bootstrap sampling, a new method is proposed to determine the best clustering number based on the between the actual value of total within-cluster sum of squares and its estimated interval. By UCI Machine Learning Repository and randomly generated artificial simulated test data sets, the experimental results show that using clustering has obvious improvement, the method can overcome falls into local optimum caused by unreasonable clustering number to select. K-means algorithm.
- Research Article
60
- 10.1186/1471-2105-9-462
- Oct 29, 2008
- BMC Bioinformatics
BackgroundInferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data.ResultsWe consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-Squares) and KL (Krzanowski and Lai index). We perform extensive experiments on six benchmark microarray datasets, using both Hierarchical and K-means clustering algorithms, and we provide an analysis assessing both the intrinsic ability of a measure to predict the correct number of clusters in a dataset and its merit relative to the other measures. We pay particular attention both to precision and speed. Moreover, we also provide various fast approximation algorithms for the computation of Gap, FOM and WCSS. The main result is a hierarchy of those measures in terms of precision and speed, highlighting some of their merits and limitations not reported before in the literature.ConclusionBased on our analysis, we draw several conclusions for the use of those internal measures on microarray data. We report the main ones. Consensus is by far the best performer in terms of predictive power and remarkably algorithm-independent. Unfortunately, on large datasets, it may be of no use because of its non-trivial computer time demand (weeks on a state of the art PC). FOM is the second best performer although, quite surprisingly, it may not be competitive in this scenario: it has essentially the same predictive power of WCSS but it is from 6 to 100 times slower in time, depending on the dataset. The approximation algorithms for the computation of FOM, Gap and WCSS perform very well, i.e., they are faster while still granting a very close approximation of FOM and WCSS. The approximation algorithm for the computation of Gap deserves to be singled-out since it has a predictive power far better than Gap, it is competitive with the other measures, but it is at least two order of magnitude faster in time with respect to Gap. Another important novel conclusion that can be drawn from our analysis is that all the measures we have considered show severe limitations on large datasets, either due to computational demand (Consensus, as already mentioned, Clest and Gap) or to lack of precision (all of the other measures, including their approximations). The software and datasets are available under the GNU GPL on the supplementary material web page.
- Research Article
79
- 10.1007/s11336-007-9013-4
- Dec 1, 2007
- Psychometrika
Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the “best-found” WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our findings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.
- Research Article
5
- 10.1080/03610928208828301
- Jan 1, 1982
- Communications in Statistics - Theory and Methods
A bounded region in R2 with a uniform density function defined over it is partitioned into k sub-regions such that the within cluster sum of squares is minimized. An asymptotic (k+∞) lower bound for the within cluster sum of squares of this optimal k-means partition is obtained. This lower bound is useful in suggesting that the graph-configuration of the optimal k-partition would consist of regular hexagons of equal size when k is large enough. An empirical study illustrating these asymptotic properties of blvariate k-means cluster is also presented.
- Book Chapter
5
- 10.1007/11527503_25
- Jan 1, 2005
The minimum sum of squares clustering problem is a nonconvex program which possesses many locally optimal values, resulting that its solution often falls into these traps. In this article, a recent metaheuristic technique, the noising method, is introduced to explore the proper clustering of data sets under the criterion of minimum sum of squares clustering. Meanwhile, K-means algorithm as a local improvement operation is integrated into the noising method to improve the performance of the clustering algorithm. Extensive computer simulations show that the proposed approach is feasible and effective.
- Research Article
- 10.1016/j.jkss.2009.04.003
- May 7, 2009
- Journal of the Korean Statistical Society
D-optimality criterion for weighting variables in K-means clustering
- Research Article
3
- 10.55606/juisik.v3i1.417
- Mar 17, 2023
- Jurnal ilmiah Sistem Informasi dan Ilmu Komputer
Natural disasters are events that significantly affect the human population. Landslides, earthquakes, floods, fires, droughts, earthquakes and other natural disasters often occur in West Java Province. Information and technology skills are developing quite fast nowadays. Thanks to modern technology, anyone can access and obtain information without restrictions. Information is very important for every aspect of life. One of them is information about natural disasters, because disaster management needs this kind of information. Data mining is a popular method for analyzing disaster data because it is considered a potential answer to disaster management challenges. Therefore, this study discusses the grouping of natural disaster areas for prediction of natural disaster areas in West Java with data mining techniques using the k-means clustering algorithm. The results of the study obtained 3 clusters including low clusters, medium clusters, and high clusters. The selected research source comes from the official website, namely West Java Open Data. The results of this research are expected to provide useful information in determining solutions to natural disaster management problems
- Research Article
15
- 10.25165/j.ijabe.20171006.2537
- Jan 1, 2017
- International Journal of Agricultural and Biological Engineering
To determine the influence of agricultural meteorological disasters on agriculture in Heilongjiang Province, the disaster areas associated with different types of disasters and their variation characteristics were analyzed based on the statistical data of agricultural disasters from 1983 to 2013 in Heilongjiang Province, China. The moving average and the Mann-Kendall test were applied to identify the variation trends of drought, flooding, hailstorms and freezing (based on the disaster ratio and the disaster intensity index). Then, the Morlet wavelet analysis method was used to identify the periodicity of these four kinds of agricultural meteorological disasters. Finally, a fuzzy comprehensive evaluation method was adopted to analyze the degrees of agricultural loss induced by these disasters. The following results were obtained: 1) The disaster ratio and disaster intensity index for drought exhibited increasing trends; the disaster ratio and disaster intensity index for flooding exhibited decreasing trends; for hailstorms, the disaster ratio exhibited no obvious trend of change, whereas the disaster intensity index exhibited an increasing trend; and for freezing, the disaster ratio also exhibited no obvious trend of change, whereas the disaster intensity index exhibited a decreasing trend. 2) Mutation points were observed in the disaster ratio series for drought, flooding and hailstorms, whereas no mutation point was evident in the disaster ratio series for freezing. 3) Multiple time-scale characteristics were observed in the disaster ratio series for all four types of agricultural meteorological disasters. Furthermore, the disaster ratio series for the different types of disasters had different main periodicities. 4) From the perspective of the degree of agricultural loss induced by each type of disaster, drought was identified as the most severe type of agricultural meteorological disaster, followed by flooding, freezing, and hailstorms. The degree of agricultural loss caused by each type of disaster was different during different periods. Finally, based on the results, several strategies were identified for mitigating the effect of agricultural meteorological disasters in Heilongjiang Province. Keywords: agricultural meteorological disaster, disaster risk assessment, disaster ratio, disaster intensity index, fuzzy comprehensive evaluation, agricultural loss DOI: 10.25165/j.ijabe.20171006.2537 Citation: Xing Z X, Yang Z R, Fu Q, Li H, Gong X L, Wu J Y. Characteristics and risk assessment of agricultural meteorological disasters based on 30 years’ disaster data from Heilongjiang Province of China. Int J Agric & Biol Eng, 2017; 10(6): 144–154.
- Research Article
9
- 10.25165/ijabe.v10i6.2537
- Nov 30, 2017
- International Journal of Agricultural and Biological Engineering
To determine the influence of agricultural meteorological disasters on agriculture in Heilongjiang Province, the disaster areas associated with different types of disasters and their variation characteristics were analyzed based on the statistical data of agricultural disasters from 1983 to 2013 in Heilongjiang Province, China. The moving average and the Mann-Kendall test were applied to identify the variation trends of drought, flooding, hailstorms and freezing (based on the disaster ratio and the disaster intensity index). Then, the Morlet wavelet analysis method was used to identify the periodicity of these four kinds of agricultural meteorological disasters. Finally, a fuzzy comprehensive evaluation method was adopted to analyze the degrees of agricultural loss induced by these disasters. The following results were obtained: 1) The disaster ratio and disaster intensity index for drought exhibited increasing trends; the disaster ratio and disaster intensity index for flooding exhibited decreasing trends; for hailstorms, the disaster ratio exhibited no obvious trend of change, whereas the disaster intensity index exhibited an increasing trend; and for freezing, the disaster ratio also exhibited no obvious trend of change, whereas the disaster intensity index exhibited a decreasing trend. 2) Mutation points were observed in the disaster ratio series for drought, flooding and hailstorms, whereas no mutation point was evident in the disaster ratio series for freezing. 3) Multiple time-scale characteristics were observed in the disaster ratio series for all four types of agricultural meteorological disasters. Furthermore, the disaster ratio series for the different types of disasters had different main periodicities. 4) From the perspective of the degree of agricultural loss induced by each type of disaster, drought was identified as the most severe type of agricultural meteorological disaster, followed by flooding, freezing, and hailstorms. The degree of agricultural loss caused by each type of disaster was different during different periods. Finally, based on the results, several strategies were identified for mitigating the effect of agricultural meteorological disasters in Heilongjiang Province. Keywords: agricultural meteorological disaster, disaster risk assessment, disaster ratio, disaster intensity index, fuzzy comprehensive evaluation, agricultural loss DOI: 10.25165/j.ijabe.20171006.2537 Citation: Xing Z X, Yang Z R, Fu Q, Li H, Gong X L, Wu J Y. Characteristics and risk assessment of agricultural meteorological disasters based on 30 years’ disaster data from Heilongjiang Province of China. Int J Agric & Biol Eng, 2017; 10(6): 144–154.
- Conference Article
- 10.1183/13993003.congress-2020.2110
- Sep 7, 2020
Introduction: Traditionally apnoeas and hypopneas have been defined using respiratory flow, but relying on flow alone cannot identify false-positive events that would over-diagnose sleep apnoea. Aim: To confirm our clinical impression that false-positive flow-determined events can be confirmed and more precisely described by using cluster analysis, an unsupervised form of machine learning. Methods: Traditional flow-based apnoea hypopnoea indices (AHI) and validated oximetry-based AHI estimates (ODI) were obtained from the automated scoring of 1000 sleep polygraphs submitted for interpretation. K means clustering was performed on the paired standardized indices. The within-cluster sum of squares (wss) was plotted against the number of potential clusters to select an optimal number of clusters. The optimal number of clusters was then plotted. Results: A bend in the plot of the wss versus the number of potential clusters suggested the optimal number of clusters was 4. Two clusters of lower and midrange AHI clusters were not segregated by ODI. Higher AHI subjects were divided into two clusters roughly midway through the range of ODI values. Clinically the higher AHI with lower ODI cluster subjects often have low amplitude baseline nasal pressure, intermittent mouth breathing, or other technical errors. Conclusion: Clinical observation and unsupervised machine learning both confirm that separate oximetry-based scoring can identify false-positive flow-based scoring of respiratory events to prevent the over-diagnosis of sleep apnoea.