A New Clustering Method Based on the Inversion Formula
Data clustering is one area of data mining that falls into the data mining class of unsupervised learning. Cluster analysis divides data into different classes by discovering the internal structure of data set objects and their relationship. This paper presented a new density clustering method based on the modified inversion formula density estimation. This new method should allow one to improve the performance and robustness of the k-means, Gaussian mixture model, and other methods. The primary process of the proposed clustering algorithm consists of three main steps. Firstly, we initialized parameters and generated a T matrix. Secondly, we estimated the densities of each point and cluster. Third, we updated mean, sigma, and phi matrices. The new method based on the inversion formula works quite well with different datasets compared with K-means, Gaussian Mixture Model, and Bayesian Gaussian Mixture model. On the other hand, new methods have limitations because this one method in the current state cannot work with higher-dimensional data (d > 15). This will be solved in the future versions of the model, detailed further in future work. Additionally, based on the results, we can see that the MIDEv2 method works the best with generated data with outliers in all datasets (0.5%, 1%, 2%, 4% outliers). The interesting point is that a new method based on the inversion formula can cluster the data even if data do not have outliers; one of the most popular, for example, is the Iris dataset.
- Research Article
17
- 10.3390/electronics11203287
- Oct 12, 2022
- Electronics
The Internet of Things (IoT) is increasingly providing industrial production objects to connect with the physical world and has been widely used in various fields. Although it has brought great industrial convenience, there are also potential security threats due to the vulnerabilities and malicious nodes in IoT. To correctly identify the traffic of malicious nodes in IoT and reduce the damage caused by malicious attacks on IoT devices, this paper proposes an autoencoder-based IoT malicious node detection method. The contributions of this paper are as follows: firstly, the high complexity multi-featured traffic data are processed and dimensionally reduced through the autoencoder to obtain the low-dimensional feature data. Then, the Bayesian Gaussian mixture model is adopted to cluster the data in a low-dimensional space to detect anomalies. Furthermore, the method of variational inference is used to estimate the parameters in the Bayesian Gaussian mixture model. To evaluate our model’s effectiveness, we used a public dataset for our experiments. As a result, in the experiment, the proposed method achieves a high accuracy rate of 99% distinguishing normal and abnormal traffic with three-dimension data reduced by the autoencoder, and it establishes our model’s better detection performance compared with previous K-means and Gaussian Mixture Model (GMM) solutions.
- Research Article
- 10.11591/ijece.v14i3.pp2834-2842
- Jun 1, 2024
- International Journal of Electrical and Computer Engineering (IJECE)
A dual-frequency measurement is employed in conjunction with an innovative Ifree filtering technique for mitigating the primary sources of Ifree influence on ground-based augmentation systems (GBAS) to safeguard the reliability of GBAS. The protective level achieved through the conventional Gaussian overbounding approach that are considered as much conventional technique. This adherence to tradition results in decreased reliability and a higher likelihood of false alarms. In contrast, the utilization of the Ifree algorithm contributes to reducing errors associated with dual-frequency measurements. This paper proposes the overbounding process according to Bayesian Gaussian mixture model (GMM) for maintaining Ifree-based GBAS range error. The Bayesian GMM is utilized for single-frequency model errors to examine the ambiguity estimations. The Monte Carlo (MC) simulation is established for defining estimated GMM assurance level accuracy which is attained through the general estimation method. Then, the last Bayesian GMM which is utilized for overbounding Ifree error distribution is investigated. According to the property of convolution invariance, the vertical protection in position field is determined without presenting difficult numerical calculations.
- Research Article
9
- 10.1121/10.0000972
- Apr 1, 2020
- The Journal of the Acoustical Society of America
Extensive ocean noise records have kurtoses markedly different from the Gaussian distribution and therefore exhibit non-Gaussianity, which influences the performance of many sonar signal processing methods. To model the amplitude distribution, this paper studies a Bayesian Gaussian mixture model (BGMM) and its associated learning algorithm, which exploits the variational inference method. The most compelling feature of the BGMM is that it automatically selects a suitable number of effective components and then can approximate a sophisticated distribution in practical applications. The probability density functions (PDFs) of three types of noise in different frequency bands collected in the South China Sea-ambient noise, ship noise, and typhoon noise-are modeled and the goodness of fit is examined by applying the one-sample Kolmogorov-Smirnov test. The results demonstrate that: (i) Ambient noise in the low-frequency band may be slightly non-Gaussian, ship noise in each considered band is apparently non-Gaussian, and typhoons affect the noise in the low-frequency band to make it apparently non-Gaussian, while the noise in the high-frequency band is less affected and appears to be Gaussian. (ii) BGMM has higher goodness of fit than the Gaussian or Gaussian mixture model. (iii) In the non-Gaussian case, despite some components having small mixing coefficients, they are of great significance for describing the PDF.
- Research Article
11
- 10.1109/ojemb.2022.3181796
- Jan 1, 2022
- IEEE Open Journal of Engineering in Medicine and Biology
Goal: To develop a computationally efficient and unbiased synthetic data generator for large-scale in silico clinical trials (CTs). Methods: We propose the BGMM-OCE, an extension of the conventional BGMM (Bayesian Gaussian Mixture Models) algorithm to provide unbiased estimations regarding the optimal number of Gaussian components and yield high-quality, large-scale synthetic data at reduced computational complexity. Spectral clustering with efficient eigenvalue decomposition is applied to estimate the hyperparameters of the generator. A case study is conducted to compare the performance of BGMM-OCE against four straightforward synthetic data generators for in silico CTs in hypertrophic cardiomyopathy (HCM). Results: The BGMM-OCE generated 30000 virtual patient profiles having the lowest coefficient-of-variation (0.046), inter- and intra-correlation differences (0.017, and 0.016, respectively) with the real ones in reduced execution time. Conclusions: BGMM-OCE overcomes the lack of population size in HCM which obscures the development of targeted therapies and robust risk stratification models.
- Research Article
162
- 10.1029/2021jb023249
- May 1, 2022
- Journal of Geophysical Research: Solid Earth
Earthquake phase association algorithms aggregate picked seismic phases from a network of seismometers into individual sesimic events and play an important role in earthquake monitoring and research. Dense seismic networks and improved phase picking methods produce massive seismic phase datasets, particularly for earthquake swarms and aftershocks occurring closely in time and space, making phase association a challenging problem. We present a new association method, the Gaussian Mixture Model Association (GaMMA), that combines the Gaussian mixture model with earthquake location, origin time, and magnitude estimation. We treat earthquake phase association as an unsupervised clustering problem in a probabilistic framework, where each earthquake corresponds to a cluster of P and S phases with a hyperbolic moveout of arrival times and a decay of amplitude with distance. We use the multivariate Gaussian distribution to model the collection of phase picks of an event; and the mean of the multivariate Gaussian distribution is given by the predicted arrival time and amplitude from the causative event. We carry out the pick assignment to each earthquake and determine earthquake source parameters (i.e., earthquake location, origin time, and magnitude) under the maximum likelihood criterion using the Expectation‐Maximization algorithm. The GaMMA method does not require typical association steps of other algorithms, such as grid‐search or supervised training. The results for both synthetic tests and for the 2019 Ridgecrest earthquake sequence show that GaMMA effectively associates phases from a temporally and spatially dense earthquake sequence while producing useful estimates of earthquake location and magnitude.
- Research Article
12
- 10.1186/s12918-019-0695-x
- Apr 1, 2019
- BMC systems biology
BackgroundSystematic fusion of multiple data sources for Gene Regulatory Networks (GRN) inference remains a key challenge in systems biology. We incorporate information from protein-protein interaction networks (PPIN) into the process of GRN inference from gene expression (GE) data. However, existing PPIN remain sparse and transitive protein interactions can help predict missing protein interactions. We therefore propose a systematic probabilistic framework on fusing GE data and transitive protein interaction data to coherently build GRN.ResultsWe use a Gaussian Mixture Model (GMM) to soft-cluster GE data, allowing overlapping cluster memberships. Next, a heuristic method is proposed to extend sparse PPIN by incorporating transitive linkages. We then propose a novel way to score extended protein interactions by combining topological properties of PPIN and correlations of GE. Following this, GE data and extended PPIN are fused using a Gaussian Hidden Markov Model (GHMM) in order to identify gene regulatory pathways and refine interaction scores that are then used to constrain the GRN structure. We employ a Bayesian Gaussian Mixture (BGM) model to refine the GRN derived from GE data by using the structural priors derived from GHMM. Experiments on real yeast regulatory networks demonstrate both the feasibility of the extended PPIN in predicting transitive protein interactions and its effectiveness on improving the coverage and accuracy the proposed method of fusing PPIN and GE to build GRN.ConclusionThe GE and PPIN fusion model outperforms both the state-of-the-art single data source models (CLR, GENIE3, TIGRESS) as well as existing fusion models under various constraints.
- Abstract
- 10.1016/j.xnsj.2024.100342
- Jul 1, 2024
- North American Spine Society Journal (NASSJ)
4. Systematic clustering analysis using multimodal data in a chronic low back pain cohort: a preliminary baseline analysis in the ComeBACK Study
- Research Article
1
- 10.3390/app15147926
- Jul 16, 2025
- Applied Sciences
Slope stability analysis is conventionally performed using the strength reduction method with the proportional reduction in shear strength parameters. However, during actual slope failure processes, the attenuation characteristics of rock mass cohesion (c) and internal friction angle (φ) are often inconsistent, and their reduction paths exhibit clear nonlinearity. Relying solely on proportional reduction paths to calculate safety factors may therefore lack scientific rigor and fail to reflect true slope behavior. To address this limitation, this study proposes a novel approach that considers the non-proportional reduction of c and φ, without dependence on predefined reduction paths. The method begins with an analysis of slope stability states based on energy dissipation theory. A Bayesian Gaussian Mixture Model (BGMM) is employed for intelligent interpretation of the dissipated energy data, and, combined with energy mutation theory, is used to identify instability states under various reduction parameter combinations. To compute the safety factor, the concept of a “reference slope” is introduced. This reference slope represents the state at which the slope reaches limit equilibrium under strength reduction. The safety factor is then defined as the ratio of the shear strength of the target analyzed slope to that of the reference slope, providing a physically meaningful and interpretable safety index. Compared with traditional proportional reduction methods, the proposed approach offers more accurate estimation of safety factors, demonstrates superior sensitivity in identifying critical slopes, and significantly improves the reliability and precision of slope stability assessments. These advantages contribute to enhanced safety management and risk control in slope engineering practice.
- Research Article
2
- 10.1186/s42400-025-00364-7
- Jun 15, 2025
- Cybersecurity
Network Intrusion Detection Systems (NIDS) are essential for safeguarding networks against malicious activities. However, existing machine learning-based NIDS often require complex feature engineering, which demands significant domain expertise and experimentation, leading to suboptimal model performance in complex network environments. In contrast, deep learning approaches, while powerful, struggle with imbalanced data, resulting in a bias towards normal traffic and reduced effectiveness in detecting rare attacks. To address these issues, we propose a method that combines contrastive learning and Bayesian Gaussian Mixture Model (BGMM). Specifically, we propose a novel contrastive learning loss that enables the model to automatically learn the similarity within normal traffic and the distinction between normal and malicious traffic, thereby generating robust and distinguishable feature representations. This approach not only eliminates the need for manual feature engineering but also helps alleviate the issue of weak feature representations for rare attacks. BGMM further enhances detection performance by adapting to both normal and malicious patterns through the use of multiple components. The effectiveness of the proposed method is validated through extensive experiments on two widely used modern network intrusion datasets. On the UNSW-NB15 dataset, the proposed method achieves 91.27% accuracy and 92.30% F1-score, which is 1.85% and 2.35% better than the state-of-the-art (SOTA) method. On the Distrinet-CIC-IDS2017 dataset, the proposed method achieves 99.66% accuracy and 99.12% F1-score, which is 0.05% and 0.12% better than the SOTA method.
- Research Article
2
- 10.1007/s10064-025-04265-4
- Apr 29, 2025
- Bulletin of Engineering Geology and the Environment
Several global or regional databases for various types of soils have been developed due to their importance in engineering design and analysis. However, a database is not yet available for collapsible loess in which severe geohazards often occur. In this study, a comprehensive loess database with twelve soil parameters is compiled by collecting results of field and laboratory tests on collapsible loess from the city of Xi’an, China. Basic statistics, marginal probability distribution functions (PDFs), and a correlation matrix for loess parameters are estimated from the database. To the best of the authors’ knowledge, this is the first collapsible loess database at a municipal level. In addition, existing databases often lack sufficiently complete multivariate measurement data for a proper estimation of statistical correlations among multiple soil properties. In this study, this incomplete multivariate measurement data problem is tackled by Bayesian methods (i.e., Bayesian Gaussian mixture model and Bayesian compressive sampling (BCS) with Karhunen–Loève (KL) expansion, BCS-KL), which are illustrated and validated using the incomplete and complete subsets of the loess database, respectively. Both the Bayesian Gaussian mixture model and BCS-KL are non-parametric, and they offer a flexible way of modeling marginal PDFs and a correlation matrix from incomplete measurements in a realistic manner.
- Research Article
5
- 10.1109/tmm.2021.3068565
- Jan 1, 2021
- IEEE Transactions on Multimedia
Speech communications and interactions frequently occur in a variety of environments. Noise in the environment significantly degrades speech intelligibility when speaking and listening. Especially in the listening stage, even if the multimedia terminal outputs clean speech, it is still difficult for listeners to obtain information. Intelligibility enhancement (IENH) of speech is a technique for overcoming the environmental noise in the listening stage. It implements a perceptual enhancement of non-noisy speech. This study focuses on IENH via normal-to-Lombard speech conversion, inspired by a well known acoustic mechanism named the Lombard effect. Our method combines the long short-term memory (LSTM) network and Bayesian Gaussian mixture model (BGMM) to build a conversion architecture. Compared with baselines, it has three main advantages: 1) an LSTM network is used for spectral tilt mapping with fully considering short-term correlations and high-dimensional expression abilities; 2) the aperiodicity (AP) is mapped together with the fundamental frequency ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> ) by a BGMM, which considers their relevance constraints and the importance of APs; 3) the gender-dependent mapping is used for <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$F_0$</tex-math></inline-formula> and APs to consider distribution differences between genders. Experiments indicate that our method gets better performance in both objective and subjective tests.
- Conference Article
- 10.65109/rbhg4778
- May 28, 2025
Methods for solving classification tasks often assume a data generating process with stable structure that remains fixed during both training and inference. However, autonomous agents deployed in real-world environments often perform classification in situations where the data generating process is dynamic and the ontology of classes is only partially known. Such tasks are known as open-world classification (OWC). We present open-world mixture modeling (OMM), a framework for OWC using Bayesian Gaussian mixture models. With only slight modifications to the standard Bayesian variational inference algorithm, we are able to detect and model novel classes as they appear in a data stream, while maintaining and updating the classes learned during training. Empirical evaluations reveal that the method reliably detects novel classes with performance similar to a supervised classifier trained on labeled samples of the novel classes.
- Conference Article
- 10.5281/zenodo.4041422
- Sep 15, 2020
- Zenodo (CERN European Organization for Nuclear Research)
Learning from demonstration (LfD) is an intuitive framework allowing non-expert users to easily (re-)program robots. However, the quality and quantity of demonstrations have a great influence on the generalization performances of LfD approaches. In this paper, we introduce a novel active learning framework in order to improve the generalization capabilities of control policies. The proposed approach is based on the epistemic uncertainties of Bayesian Gaussian mixture models (BGMMs). We determine the new query point location by optimizing a closed-form information-density cost based on the quadratic R\'enyi entropy. Furthermore, to better represent uncertain regions and to avoid local optima problem, we propose to approximate the active learning cost with a Gaussian mixture model (GMM). We demonstrate our active learning framework in the context of a reaching task in a cluttered environment with an illustrative toy example and a real experiment with a Panda robot.
- Conference Article
5
- 10.1109/iros45743.2020.9341187
- Oct 24, 2020
Learning from demonstration (LfD) is an intuitive framework allowing non-expert users to easily (re-)program robots. However, the quality and quantity of demonstrations have a great influence on the generalization performances of LfD approaches. In this paper, we introduce a novel active learning framework in order to improve the generalization capabilities of control policies. The proposed approach is based on the epistemic uncertainties of Bayesian Gaussian mixture models (BGMMs). We determine the new query point location by optimizing a closed-form information-density cost based on the quadratic Renyi entropy. Furthermore, to better represent uncertain regions and to avoid local optima problem, we propose to approximate the active learning cost with a Gaussian mixture model (GMM). We demonstrate our active learning framework in the context of a reaching task in a cluttered environment with an illustrative toy example and a real experiment with a Panda robot.
- Conference Article
2
- 10.1109/iwaci.2010.5585111
- Aug 1, 2010
In this study, a multi-level medical image semantic modeling approach based on fuzzy Bayesian networks is proposed. Its two forms are built. The one is a Bayesian network embedding Conditional Gaussian (CG) models, called BN-CG, and another is a Bayesian network embedding Gaussian mixture model (GMM), called BN-GMM. CG and GMM are employed to implement a fuzzy procedure to perform the soft quantification of the continuous visual feature of the medical images, which extract the middle level semantics of the pathological objects, using the probability as the confidence score. Finally, a Bayesian network is utilized to combine these middle level semantics to build a multi-level semantic model. BN-CG and BN-GMM model are tested at multiple levels of semantics by applying a small set of astrocytona MRI (Magnetic Resonance Imaging) image samples. The experiment results show that this approach is very effective to enable the auto-annotation and interpretation of astrocytona MRI images. These models outperform the Bayesian network-based crisp quantification model using k-nearest neighbor classifiers (K-NN). This study provides a novel way to assist radiologist to retrieve medical images.