Selection of the number of clusters in functional data analysis
Identifying the number K of clusters in a dataset is one of the most difficult problems in clustering analysis. A choice of K that correctly characterizes the features of the data is essential for building meaningful clusters. In this paper we tackle the problem of estimating the number of clusters in functional data analysis by introducing a new measure that can be used with different procedures in selecting the optimal K. The main idea is to use a combination of two test statistics, which measure the lack of parallelism and the mean distance between curves, to compute criteria such as the within and between cluster sum of squares. Simulations in challenging scenarios suggest that procedures using this measure can detect the correct number of clusters more frequently than existing methods in the literature. The application of the proposed method is illustrated on several real datasets.
- Research Article
6687
- 10.1080/03610927408827101
- Jan 1, 1974
- Communications in Statistics - Theory and Methods
A method for identifying clusters of points in a multidimensional Euclidean space is described and its application to taxonomy considered. It reconciles, in a sense, two different approaches to the investigation of the spatial relationships between the points, viz., the agglomerative and the divisive methods. A graph, the shortest dendrite of Florek etal. (1951a), is constructed on a nearest neighbour basis and then divided into clusters by applying the of minimum within cluster sum of squares. This procedure ensures an effective reduction of the of possible splits. The method may be applied to a dichotomous division, but is perfectly suitable also for a global division into any of clusters. An informal indicator of the best number of clusters is suggested. It is avariance ratio criterion giving some insight into the structure of the points. The method is illustrated by three examples, one of which is original. The results obtained by the dendrite method are compared with those...
- Research Article
3
- 10.33365/jti.v17i1.2381
- Jan 1, 2023
- Jurnal Teknoinfo
Disasters have a major impact on several sectors, such as infrastructure, manufacturing, tourism and transportation. One way to prepare for or improve disaster preparedness is to implement preventive measures. Preventive actions can be taken by identifying disasters in each area from past data. This study aims to map areas affected by disasters to facilitate disaster preparedness programs. The data used in this research are areas of West Java that will be affected by the disaster in 2022 from January to October. The disaster data used in this study are floods, landslides, abrasion, tornadoes, droughts, fires, earthquakes and tsunamis. Research to use data mining techniques, namely grouping techniques. The clustering algorithm used in this study is the K-means cluster. The clustering process was carried out several times to find out the comparison of the quality of the grouping results which in this study used the Within Cluster Sum of Squares (WSS). The best WSS value is when the number of k or the number of clusters is 5, which is 89.8%. This research is expected to be a reference for disaster preparedness. This research also produced disaster grouping maps, where each cluster has different characteristics or types of disaster.
- Conference Article
13
- 10.1109/eem.2013.6607329
- May 1, 2013
This study examines a set of methods that determine the optimal number of clusters in the electricity consumer segmentation procedure. For the purpose of clustering the load curves of the consumers, we involve two algorithms of different concept and complexity, namely the Minimum Variance Method (MVM) hierarchical agglomerative algorithm and the Fuzzy C-Means (FCM). A parametric analysis takes place in order to optimize the FCM` s parameters. Apart from the two clustering algorithms, we introduce in the load profiling studies two other methods that provide indications of the number of clusters within a data sample, namely the Max-Min and the Chain-map methods. For the sake of assessing the algorithm effectiveness, we utilize the ratio of Within Cluster sum of squares to Between Cluster variation (WCBCR) adequacy measure and the Bayesian Information Criterion (BIC). We also propose an improved version of the WCBCR.
- Conference Article
20
- 10.1063/5.0108926
- Jan 1, 2022
- AIP conference proceedings
The grouping of potential students conducts to determine the student's interest and increase the student's academic performance. The K-Means algorithm could do collection or clusterization. This study aims to implement one of the Machine Learning algorithms, K-Means, to classify the potential of interest grouping of Informatics Engineering student's batch 2019 at the Universitas Muhammadiyah Purwokerto. The process of categorization was based on average course values, which are a part of student specializations, namely 1) Intelligent Systems (IS), 2) Software Engineering (SE), 3) Computer Networks (CN), and 4) Multimedia (MM), as well as student's GPA data (semester 1 to semester 4). Moreover, this research involves the Elbow method for determining the number of optimal clusters and Sum of Squared Errors (SSE) as a cluster validation technique. From the Elbow process, Within Cluster Sum of Squares (WCSS) significantly decreases when K is significantly upwards from 2 to 3, and the SSE maximum rate of change is 71.29 %. Therefore, the optimal cluster is 3. With K-Means clustering results, the majority of the students (62 or 41.05 %) are assigned to the Intelligent System group, the second majority (59 or 39.07 %) to the Multimedia group. At the same time, a cluster of Computer Networks was the group with the fewest members.
- Research Article
11
- 10.1002/bimj.201400251
- Apr 13, 2016
- Biometrical Journal
In this work we propose the use of functional data analysis (FDA) to deal with a very large dataset of atmospheric aerosol size distribution resolved in both space and time. Data come from a mobile measurement platform in the town of Perugia (Central Italy). An OPC (Optical Particle Counter) is integrated on a cabin of the Minimetrò, an urban transportation system, that moves along a monorail on a line transect of the town. The OPC takes a sample of air every six seconds and counts the number of particles of urban aerosols with a diameter between 0.28 μm and 10 μm and classifies such particles into 21 size bins according to their diameter. Here, we adopt a 2D functional data representation for each of the 21 spatiotemporal series. In fact, space is unidimensional since it is measured as the distance on the monorail from the base station of the Minimetrò. FDA allows for a reduction of the dimensionality of each dataset and accounts for the high space-time resolution of the data. Functional cluster analysis is then performed to search for similarities among the 21 size channels in terms of their spatiotemporal pattern. Results provide a good classification of the 21 size bins into a relatively small number of groups (between three and four) according to the season of the year. Groups including coarser particles have more similar patterns, while those including finer particles show a more different behavior according to the period of the year. Such features are consistent with the physics of atmospheric aerosol and the highlighted patterns provide a very useful ground for prospective model-based studies.
- Research Article
282
- 10.1080/03610910903168603
- Oct 1, 2009
- Communications in Statistics - Simulation and Computation
Functional data analysis (FDA)—the analysis of data that can be considered a set of observed continuous functions—is an increasingly common class of statistical analysis. One of the most widely used FDA methods is the cluster analysis of functional data; however, little work has been done to compare the performance of clustering methods on functional data. In this article, a simulation study compares the performance of four major hierarchical methods for clustering functional data. The simulated data varied in three ways: the nature of the signal functions (periodic, non periodic, or mixed), the amount of noise added to the signal functions, and the pattern of the true cluster sizes. The Rand index was used to compare the performance of each clustering method. As a secondary goal, clustering methods were also compared when the number of clusters has been misspecified. To illustrate the results, a real set of functional data was clustered where the true clustering structure is believed to be known. Comparing the clustering methods for the real data set confirmed the findings of the simulation. This study yields concrete suggestions to future researchers to determine the best method for clustering their functional data.
- Research Article
5
- 10.1080/03610928208828301
- Jan 1, 1982
- Communications in Statistics - Theory and Methods
A bounded region in R2 with a uniform density function defined over it is partitioned into k sub-regions such that the within cluster sum of squares is minimized. An asymptotic (k+∞) lower bound for the within cluster sum of squares of this optimal k-means partition is obtained. This lower bound is useful in suggesting that the graph-configuration of the optimal k-partition would consist of regular hexagons of equal size when k is large enough. An empirical study illustrating these asymptotic properties of blvariate k-means cluster is also presented.
- Book Chapter
5
- 10.1007/11527503_25
- Jan 1, 2005
The minimum sum of squares clustering problem is a nonconvex program which possesses many locally optimal values, resulting that its solution often falls into these traps. In this article, a recent metaheuristic technique, the noising method, is introduced to explore the proper clustering of data sets under the criterion of minimum sum of squares clustering. Meanwhile, K-means algorithm as a local improvement operation is integrated into the noising method to improve the performance of the clustering algorithm. Extensive computer simulations show that the proposed approach is feasible and effective.
- Book Chapter
21
- 10.1007/978-3-319-23219-5_39
- Jan 1, 2015
The Within-Cluster Sum of Squares (WCSS) is the most used criterion in cluster analysis. Optimizing this criterion is proved to be NP-Hard and has been studied by different communities. On the other hand, Constrained Clustering allowing to integrate previous user knowledge in the clustering process has received much attention this last decade. As far as we know, there is a single approach that aims at finding the optimal solution for the WCSS criterion and that integrates different kinds of user constraints. This method is based on integer linear programming and column generation. In this paper, we propose a global optimization constraint for this criterion and develop a filtering algorithm. It is integrated in our Constraint Programming general and declarative framework for Constrained Clustering. Experiments on classic datasets show that our approach outperforms the exact approach based on integer linear programming and column generation.
- Research Article
2
- 10.18805/ag.d-5753
- Jun 10, 2023
- Agricultural Science Digest - A Research Journal
Background: Snake gourd is a monoecious crop that prefers cross pollination. Snake gourd has a lot of potential for genetic improvement. A large variation can be produced when genetically diverse and geographically distant lines are combined. To examine the genetic diversity and connection between essential agronomic features in snake gourd, multivariate methods such as principal component analysis and cluster analysis were used. This study will use multivariate analysis to determine the genetic diversity and link between critical agronomic aspects of snake gourd. Methods: A total of sixteen genotypes and two varieties of snake gourd genotypes were subjected to boxplot, principal component analysis and cluster analysis based on eleven quantitative traits. Boxplot analysis, Principal component analysis and cluster analysis were performed using R version of 4.2.1. Result: Boxplot analysis depicted the frequency distribution of eleven quantitative traits among 18 snake gourd accessions. The overall variation was split into eleven principal components, out of which five major principal components contributed for variability of snake gourd genotypes by exhibiting 90.05 per cent of variability. The squared cosine variables inferred that the traits viz., days to first male flowering, days to first female flowering and days to first harvest contributed more for variability in the first component. The ward D2 method of hierarchical clustering cluster the 16 genotypes and 2 varieties in two clusters based on cluster sum of squares.
- Conference Article
1
- 10.1063/1.4982842
- Jan 1, 2017
- AIP conference proceedings
This research implements two important main fields of the study which is functional data analysis and extreme value theory. The aims of this study are to convert the extreme rainfall data into functions and to propose method development for functional data analysis in extreme value theory. Functional data analysis is one of the techniques to convert or transform discrete observations into functions or curves. The daily rainfall data in Petaling Jaya from 1974 to 2003 will be tested on extreme value theory approach which is block maxima in order to practice the functional data analysis technique. The least squares and akaike criteria information method are used to smooth the fourier basis of the functional data analysis. A descriptive statistics which represents the behavior of a rainfall data for Petaling Jaya is analyzed and illustrated by a smooth curve from functional data analysis. This study indicates that there is an increase in rates of the rainfall in Petaling Jaya for the recent period compared to the last 30 years data both in the mean and standard deviation values.
- Research Article
467
- 10.1214/009053606000000272
- Jun 1, 2006
- The Annals of Statistics
The use of principal component methods to analyze functional data is appropriate in a wide range of different settings. In studies of ``functional data analysis,'' it has often been assumed that a sample of random functions is observed precisely, in the continuum and without noise. While this has been the traditional setting for functional data analysis, in the context of longitudinal data analysis a random function typically represents a patient, or subject, who is observed at only a small number of randomly distributed points, with nonnegligible measurement error. Nevertheless, essentially the same methods can be used in both these cases, as well as in the vast number of settings that lie between them. How is performance affected by the sampling plan? In this paper we answer that question. We show that if there is a sample of $n$ functions, or subjects, then estimation of eigenvalues is a semiparametric problem, with root-$n$ consistent estimators, even if only a few observations are made of each function, and if each observation is encumbered by noise. However, estimation of eigenfunctions becomes a nonparametric problem when observations are sparse. The optimal convergence rates in this case are those which pertain to more familiar function-estimation settings. We also describe the effects of sampling at regularly spaced points, as opposed to random points. In particular, it is shown that there are often advantages in sampling randomly. However, even in the case of noisy data there is a threshold sampling rate (depending on the number of functions treated) above which the rate of sampling (either randomly or regularly) has negligible impact on estimator performance, no matter whether eigenfunctions or eigenvectors are being estimated.
- Research Article
- 10.29220/csam.2018.25.6.619
- Nov 30, 2018
- Communications for Statistical Applications and Methods
Functional data analysis continues to attract interest because advances in technology across many fields have increasingly permitted measurements to be made from continuous processes on a discretized scale. Particulate matter is among the most harmful air pollutants affecting public health and the environment, and levels of PM10 (particles less than 10 micrometers in diameter) for regions of California remain among the highest in the United States. The relatively high frequency of particulate matter sampling enables us to regard the data as functional data. In this work, we investigate the dominant modes of variation of PM10 using functional data analysis methodologies. Our analysis provides insight into the underlying data structure of PM10, and it captures the size and temporal variation of this underlying data structure. In addition, our study shows that certain aspects of size and temporal variation of the underlying PM10 structure are associated with changes in large-scale climate indices that quantify variations of sea surface temperature and atmospheric circulation patterns.
- Research Article
17
- 10.1190/geo2021-0096.1
- Nov 10, 2021
- GEOPHYSICS
In subsurface modeling and characterization, predicting the spatial distribution of subsurface elastic properties is commonly achieved by seismic inversion. Stochastic seismic inversion methods, such as iterative geostatistical seismic inversion (GSI), are widely applied to this end. Global iterative GSI methods are computationally expensive because they require, at a given iteration, the stochastic sequential simulation of the entire inversion grid at once multiple times. Functional data analysis (FDA) is a well-established statistical method suited to model long-term and noisy temporal series. This method allows us to summarize spatiotemporal series in a set of analytical functions with a low-dimension representation. FDA has been recently extended to problems related to geosciences, but its application to geophysics is still limited. We have developed the use of FDA as a model reduction technique during the model perturbation step in global iterative GSI. FDA is used to collapse the vertical dimension of the inversion grid. We illustrate our hybrid inversion method with its application to 3D synthetic and real data sets. The results indicate the ability of our inversion methodology to predict smooth inverted subsurface models that match the observed data at a similar convergence as obtained by a global iterative GSI, but with a considerable decrease in the computational cost. Although the resolution of the inverted models might not be enough for a detailed subsurface characterization, the inverted models can be used as a starting point of global iterative GSI to speed up the inversion or to test alternative geologic scenarios by changing the inversion parameterization and obtaining inverted models in a relatively short time.
- Book Chapter
20
- 10.1007/978-1-4939-4020-2_1
- Jan 1, 2016
This textbook is dedicated to the study of functional data analysis and shape analysis of curves in Euclidean spaces. In the first item, one develops tools for statistical analysis of real-valued functional data on fixed intervals. While function data analysis is a broad topic area, worthy of a textbook in itself, we will focus heavily on a specific aspect that deals with alignment or registration of functional data. In the second item, one studies shapes formed by curves in 2D, 3D, and higher dimensions, with a goal of performing statistical inferences. Since these curves are also functions, albeit vector valued, and the issue of registration is of prime importance in their shape analysis, we will cover these topics under a broad umbrella of elastic functional and shape data analysis!