Robust Model‐Based Semi‐Supervised Clustering of Incomplete Records
ABSTRACT This paper develops a multivariate ‐mixture model‐based semi‐supervised clustering methodology for datasets with incomplete records. Specifically, we consider the case where not all features are always observed, as well as the case where label information for some of the records is available, where the interest is in grouping all of them. Our modeling allows for constraints on the shape, size, and orientation of the scale matrices in the component mixtures, and develops a fast alternating expected conditional maximization algorithm for parameter estimation in the semi‐supervised framework that crucially includes the setup where not all classes in the dataset necessarily have representation in the labels. The total number of groups is assessed using Bayesian information criterion. Our approach is evaluated on simulated datasets of different clustering complexity as well as amounts and structure in the unobserved parts of the records or labels. The methodology is applied to further characterize fraudulent and legitimate credit card transactions, and also to categorize incidence and severity in heart disease patients. The publicly available R package MixtClust implements our methods.
- Research Article
31
- 10.1111/j.2041-210x.2011.00175.x
- Jan 23, 2012
- Methods in Ecology and Evolution
Summary1. Capture–recapture mixture models are important tools in evolution and ecology to estimate demographic parameters and abundance while accounting for individual heterogeneity. A key step is to select the correct number of mixture components i) to provide unbiased estimates that can be used as reliable proxies of fitness or ingredients in management strategies and ii) classify individuals into biologically meaningful classes. However, there is no consensus method in the statistical literature for selecting the number of components.2. In ecology, most studies rely on the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) that has recently gained attention in ecology. The Integrated Completed Likelihood criterion (ICL; IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22, 719) was specifically developed to favour well‐separated components, but its use has never been investigated in ecology.3. We compared the performance of AIC, BIC and ICL for selecting the number of components with regard to a) bias and accuracy of survival and detection estimates and b) success in selecting the true number of components using extensive simulations and data on wolf (Canis lupus) that were used for management through survival and abundance estimation.4. Bias in survival and detection estimates was <0.02 for both AIC and BIC, and more than 0.09 for ICL, while mean square error was <0.05 for all criteria. As expected, bias increased as heterogeneity increased. Success rates of AIC and BIC in selecting the ‘true’ number of components were better than ICL (68% for AIC, 58% for BIC, and 16% for ICL). As the degree of heterogeneity increased, AIC (and BIC in a lesser extent) overestimated the number of components, while ICL often underestimated this number. For the wolf study, the 2‐class model was selected by BIC and ICL, while AIC could not decide between the 2‐ and 3‐class models.5. We recommend using AIC or BIC when the aim is to estimate parameters. Regarding classification, we suggest taking the classification quality into account by using ICL in conjunction with BIC, pending further work to adapt its penalty term for capture–recapture data.
- Research Article
40
- 10.1111/cla.12581
- May 10, 2024
- Cladistics
Although simulations have shown that implied weighting (IW) outperforms equal weighting (EW) in phylogenetic parsimony analyses, weighting against homoplasy lacks extensive usage in palaeontology. Iterative modifications of several phylogenetic matrices in the last decades resulted in extensive genealogies of datasets that allow the evaluation of differences in the stability of results for alternative character weighting methods directly on empirical data. Each generation was compared against the most recent generation in each genealogy because it is assumed that it is the most comprehensive (higher sampling), revised (fewer misscorings) and complete (lower amount of missing data) matrix of the genealogy. The analyses were conducted on six different genealogies under EW and IW and extended implied weighting (EIW) with a range of concavity constant values (k) between 3 and 30. Pairwise comparisons between trees were conducted using Robinson-Foulds distances normalized by the total number of groups, distortion coefficient, subtree pruning and regrafting moves, and the proportional sum of group dissimilarities. The results consistently show that IW and EIW produce results more similar to those of the last dataset than EW in the vast majority of genealogies and for all comparative measures. This is significant because almost all of these matrices were originally analysed only under EW. Implied weighting and EIW do not outperform each other unambiguously. Euclidean distances based on a principal components analysis of the comparative measures show that different ranges of k-values retrieve the most similar results to the last generation in different genealogies. There is a significant positive linear correlation between the optimal k-values and the number of terminals of the last generations. This could be employed to inform about the range of k-values to be used in phylogenetic analyses based on matrix size but with the caveat that this emergent relationship still relies on a low sample size of genealogies.
- Research Article
15
- 10.1109/tcyb.2014.2298401
- Oct 1, 2014
- IEEE Transactions on Cybernetics
This paper concerns model selection for mixtures of probabilistic principal component analyzers (MPCA). The well known Bayesian information criterion (BIC) is frequently used for this purpose. However, it is found that BIC penalizes each analyzer implausibly using the whole sample size. In this paper, we present a new criterion for MPCA called hierarchical BIC in which each analyzer is penalized using its own effective sample size only. Theoretically, hierarchical BIC is a large sample approximation of variational Bayesian lower bound and BIC is a further approximation of hierarchical BIC. To learn hierarchical-BIC-based MPCA, we propose two efficient algorithms: two-stage and one-stage variants. The two-stage algorithm integrates model selection with respect to the subspace dimensions into parameter estimation, and the one-stage variant further integrates the selection of the number of mixture components into a single algorithm. Experiments on a number of synthetic and real-world data sets show that: 1) hierarchical BIC is more accurate than BIC and several related competitors and 2) the two proposed algorithms are not only effective but also much more efficient than the classical two-stage procedure commonly used for BIC.
- Research Article
- 10.3929/ethz-a-004336493
- Jan 1, 2002
ML-estimation based on mixtures of Normal distributions is a widely used tool for cluster analysis. However, a single outlier can make the parameter estimation of at least one of the mixture components break down. Among others, the estimation of mixtures of t-distributions by McLachlan and Peel [Finite Mixture Models (2000) Wiley, New York] and the addition of a further mixture component accounting for ?noise? by Fraley and Raftery [The Computer J. 41 (1998) 578?588] were suggested as more robust alternatives. In this paper, the definition of an adequate robustness measure for cluster analysis is discussed and bounds for the breakdown points of the mentioned methods are given. It turns out that the two alternatives, while adding stability in the presence of outliers of moderate size, do not possess a substantially better breakdown behavior than estimation based on Normal mixtures. If the number of clusters s is treated as fixed, r additional points suffice for all three methods to let the parameters of r clusters explode. Only in the case of r=s is this not possible for t-mixtures. The ability to estimate the number of mixture components, for example, by use of the Bayesian information criterion of Schwarz [Ann. Statist. 6 (1978) 461?464], and to isolate gross outliers as clusters of one point, is crucial for an improved breakdown behavior of all three techniques. Furthermore, a mixture of Normals with an improper uniform distribution is proposed to achieve more robustness in the case of a fixed number of components.
- Research Article
9
- 10.1016/j.jpdc.2020.01.005
- Feb 8, 2020
- Journal of Parallel and Distributed Computing
Modeling I/O performance variability in high-performance computing systems using mixture distributions
- Conference Article
5
- 10.1109/icassp.2011.5947360
- May 1, 2011
The aim of this work is to apply a sampling approach to speech modeling, and propose a Gibbs sampling based Multi-scale Mixture Model (M <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> ). The proposed approach focuses on the multi-scale property of speech dynamics, i.e., dynamics in speech can be observed on, for instance, short-time acoustical, linguistic-segmental, and utterance-wise temporal scales. M <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">3</sup> is an extension of the Gaussian mixture model and is considered a hierarchical mixture model, where mixture components in each time scale will change at intervals of the corresponding time unit. We derive a fully Bayesian treatment of the multi-scale mixture model based on Gibbs sampling. The advantage of the proposed model is that each speaker cluster can be precisely modeled based on the Gaussian mixture model unlike conventional single-Gaussian based speaker clustering (e.g., using the Bayesian Information Criterion (BIC)). In addition, Gibbs sampling offers the potential to avoid a serious local optimum problem. Speaker clustering experiments confirmed these advantages and obtained a significant improvement over the conventional BIC based approaches.
- Research Article
41
- 10.1177/1471082x13503455
- May 14, 2014
- Statistical Modelling
In the context of mixture models with random covariates, this article presents the polynomial Gaussian cluster-weighted model (CWM). It extends the linear Gaussian CWM, for bivariate data, in a twofold way. First, it allows for possible nonlinear dependencies in the mixture components by considering a polynomial regression. Second, it is not restricted to be used for model-based clustering only being contextualized in the most general model-based classification framework. Maximum likelihood parameter estimates are derived using the EM algorithm and model selection is carried out using the Bayesian information criterion (BIC) and the integrated completed likelihood (ICL). The article also investigates the conditions under which the posterior probabilities of component-membership from a polynomial Gaussian CWM coincide with those of other well-established mixture-models which are related to it. When applied to artificial and real data, the polynomial Gaussian CWM has shown to outperform the mixture of polynomial Gaussian regressions, which is its natural competitor in the class of mixture models with fixed covariates.
- Research Article
2
- 10.1016/j.jkss.2015.04.003
- May 13, 2015
- Journal of the Korean Statistical Society
Finding standard dental arch forms from a nationwide standard occlusion study using a Gaussian functional mixture model
- Conference Article
5
- 10.21437/interspeech.2010-24
- Sep 26, 2010
In this paper, we propose a novel boosted mixture learning (BML) framework for Gaussian mixture HMMs in speech recognition. BML is an incremental method to learn mixture models for classification problem. In each step of BML, one new mixture component is calculated according to functional gradient of an objective function to ensure that it is added along the direction to maximize the objective function the most. Several techniques have been proposed to extend BML from simple mixture models like Gaussian mixture model (GMM) to Gaussian mixture hidden Markov model (HMM), including Viterbi approximation to obtain state segmentation, weight decay to initialize sample weights to avoid overfitting, combining partial updating with global updating of parameters and using Bayesian information criterion (BIC) for parsimonious modeling. Experimental results on the WSJ0 task have shown that the proposed BML yields relative word and sentence error rate reduction of 10.9% and 12.9%, respectively, over the conventional training procedure.
- Research Article
17
- 10.1111/rssb.12333
- Aug 5, 2019
- Journal of the Royal Statistical Society Series B: Statistical Methodology
SummaryChoosing the number of mixture components remains an elusive challenge. Model selection criteria can be either overly liberal or conservative and return poorly separated components of limited practical use. We formalize non-local priors (NLPs) for mixtures and show how they lead to well-separated components with non-negligible weight, interpretable as distinct subpopulations. We also propose an estimator for posterior model probabilities under local priors and NLPs, showing that Bayes factors are ratios of posterior-to-prior empty cluster probabilities. The estimator is widely applicable and helps to set thresholds to drop unoccupied components in overfitted mixtures. We suggest default prior parameters based on multimodality for normal–T-mixtures and minimal informativeness for categorical outcomes. We characterize theoretically the NLP-induced sparsity, derive tractable expressions and algorithms. We fully develop normal, binomial and product binomial mixtures but the theory, computation and principles hold more generally. We observed a serious lack of sensitivity of the Bayesian information criterion, insufficient parsimony of the Akaike information criterion and a local prior, and a mixed behaviour of the singular Bayesian information criterion. We also considered overfitted mixtures; their performance was competitive but depended on tuning parameters. Under our default prior elicitation NLPs offered a good compromise between sparsity and power to detect meaningfully separated components.
- Conference Article
22
- 10.1145/956750.956775
- Aug 24, 2003
The goal of clustering is to identify groups in a dataset. The basic idea of model-based clustering is to approximate the data density by a mixture model, typically a mixture of Gaussians, and to estimate the parameters of the component densities, the mixing fractions, and the number of components from the data. The number of groups in the data is then taken to be the number of mixture components, and the observations are partitioned into clusters (estimates of the groups) using Bayes' rule. If the groups are well separated and look Gaussian, then the resulting clusters will indeed tend to be distinct in the most common sense of the word - contiguous, densely populated areas of feature space, separated by contiguous, relatively empty regions. If the groups are not Gaussian, however, this correspondence may break down; an isolated group with a non-elliptical distribution, for example, may be modeled by not one, but several mixture components, and the corresponding clusters will no longer be well separated. We present methods for assessing the degree of separation between the components of a mixture model and between the corresponding clusters. We also propose a new clustering method that can be regarded as a hybrid between model-based and nonparametric clustering. The hybrid clustering algorithm prunes the cluster tree generated by hierarchical model-based clustering. Starting with the tree corresponding to the mixture model chosen by the Bayesian Information Criterion, it progressively merges clusters that do not appear to correspond to different modes of the data density.
- Research Article
19
- 10.1109/jstsp.2009.2038312
- Jun 1, 2010
- IEEE Journal of Selected Topics in Signal Processing
Multivariate Gaussian mixture models (GMMs) are widely for density estimation, model-based data clustering, and statistical classification. A difficult problem is estimating the model order, i.e., the number of mixture components, and model structure. Use of full covariance matrices, with number of parameters quadratic in the feature dimension, entails high model complexity, and thus may underestimate order, while naive Bayes mixtures may introduce model bias and lead to order overestimates. We develop a parsimonious modeling and model order and structure selection method for GMMs which allows for and optimizes over parameter tying configurations across mixture components applied to each individual parameter, including the covariates. We derive a generalized Expectation-Maximization algorithm for [(Bayesian information criterion (BIC)-based] penalized likelihood minimization. This, coupled with sequential model order reduction, forms our joint learning and model selection. Our method searches over a rich space of models and, consistent with minimizing BIC, achieves fine-grained matching of model complexity to the available data. We have found our method to be effective and largely robust in learning accurate model orders and parameter-tying structures for simulated ground-truth mixtures. We compared against naive Bayes and standard full-covariance GMMs for several criteria: 1) model order and structure accuracy (for synthetic data sets); 2) test set log-likelihood; 3) unsupervised classification accuracy; and 4) accuracy when class-conditional mixtures are used in a plug-in Bayes classifier. Our method, which chooses model orders intermediate between standard and naive Bayes GMMs, gives improved accuracy with respect to each of these performance measures.
- Book Chapter
1
- 10.1007/978-3-319-47121-1_1
- Jan 1, 2016
Modeling user behavior and latent preference implied in rating data are the basis of personalized information services. In this paper, we adopt a latent variable to describe user preference and Bayesian network (BN) with a latent variable as the framework for representing the relationships among the observed and the latent variables, and define user preference BN (abbreviated as UPBN). To construct UPBN effectively, we first give the property and initial structure constraint that enable conditional probability distributions (CPDs) related to the latent variable to fit the given data set by the Expectation-Maximization (EM) algorithm. Then, we give the EM-based algorithm for constraint-based maximum likelihood estimation of parameters to learn UPBN’s CPDs from the incomplete data w.r.t. the latent variable. Following, we give the algorithm to learn the UPBN’s graphical structure by applying the structural EM (SEM) algorithm and the Bayesian Information Criteria (BIC). Experimental results show the effectiveness and efficiency of our method.
- Research Article
2
- 10.1109/access.2018.2849419
- Jan 1, 2018
- IEEE Access
Power battery is the core component of electric vehicles, and its characteristics determine the performance of electric vehicles. Lithium-ion power battery is a kind of time-varying nonlinear system, which has different external features under different application conditions and aging state. In order to implement the optimized charging method under changeable application conditions, the estimation of the internal states is extremely necessary. For this purpose, this paper studies the multiple states joint estimation based on the accurate and reliable battery modeling method. First, the modeling of lithium-ion battery and the parameter identification of the models are studied. A parameter identification method is proposed based on forgetting factor recursive extended least square. The battery model’s order is based on Bayesian information criterions. Second, multiple states joint estimation algorithm for power battery under charging mode is studied. For the estimated accuracy of the battery’s state of charge is associated with the battery usable capacity, state of charge and usable capacity joint estimation algorithm, based on several different order battery models is proposed. State of power estimation algorithm of the battery under multiple factors constraints is proposed based on the solution discrete analysis of the continuous time traction battery differential state equation. Finally, multiple states joint estimation algorithm is achieved. Validation results show that the proposed estimation algorithm can achieve high accuracy.
- Research Article
16
- 10.3390/forecast3010004
- Feb 8, 2021
- Forecasting
Forecasting, using time series data, has become the most relevant and effective tool for fisheries stock assessment. Autoregressive integrated moving average (ARIMA) modeling has been commonly used to predict the general trend for fish landings with increased reliability and precision. In this paper, ARIMA models were applied to predict Lake Malombe annual fish landings and catch per unit effort (CPUE). The annual fish landings and CPUE trends were first observed and both were non-stationary. The first-order differencing was applied to transform the non-stationary data into stationary. Autocorrelation functions (AC), partial autocorrelation function (PAC), Akaike information criterion (AIC), Bayesian information criterion (BIC), square root of the mean square error (RMSE), the mean absolute error (MAE), percentage standard error of prediction (SEP), average relative variance (ARV), Gaussian maximum likelihood estimation (GMLE) algorithm, efficiency coefficient (E2), coefficient of determination (R2), and persistent index (PI) were estimated, which led to the identification and construction of ARIMA models, suitable in explaining the time series and forecasting. According to the measures of forecasting accuracy, the best forecasting models for fish landings and CPUE were ARIMA (0,1,1) and ARIMA (0,1,0). These models had the lowest values AIC, BIC, RMSE, MAE, SEP, ARV. The models further displayed the highest values of GMLE, PI, R2, and E2. The “auto. arima ()” command in R version 3.6.3 further displayed ARIMA (0,1,1) and ARIMA (0,1,0) as the best. The selected models satisfactorily forecasted the fish landings of 2725.243 metric tons and CPUE of 0.097 kg/h by 2024.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.