Chapter 4 - Unsupervised machine learning: clustering algorithms
Chapter 4 - Unsupervised machine learning: clustering algorithms
- # Local Outlier Factor
- # Isolation Forest
- # Density-based Spatial Clustering Of Applications
- # Density-based Spatial Clustering Of Applications With Noise
- # K-means Clustering
- # Unsupervised Machine Learning Algorithms
- # Unsupervised Machine Learning
- # Outlier Detection Techniques
- # Elbow Point
- # Scikit-learn Library
- Research Article
- 10.64534/commer.2025.511
- Sep 30, 2025
- Pakistan Journal of Commerce and Social Sciences
The rapid integration of cryptocurrencies into the global financial ecosystem has introduced unprecedented challenges in market surveillance, risk management, and anomaly detection. While conventional statistical models such as ARIMA (Autoregressive Integrated Moving Average) and GARCH (Generalized Autoregressive Conditional Heteroscedasticity) have been widely used for anomaly detection, their reliance on assumptions of normality and stationarity often fails to capture the complexities of high-frequency, non-linear cryptocurrency trading. Furthermore, traditional risk metrics including down-to-up volatility, negative conditional skewness, and relative frequency may overlook short-term anomalies due to data aggregation limitations. In order to address these issues, this paper proposes machine-learning model for detecting anomalies in cryptocurrency markets using Jupyter Notebook. We compare four advanced unsupervised machine learning models, i.e, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Isolation Forest (iForest), One-Class Support Vector Machine (OC-SVM), and Local Outlier Factor (LOF) for anomaly detection by using Monte Carlo simulations. The findings indicate that DBSCAN has the highest precision (79.7%) with the fewest false positives, making it ideal for supervisory monitoring. However, the high false positive rates of OC-SVM and Isolation Forest limit their use. By using data of six well-known cryptocurrencies at three different temporal resolutions (daily, hourly, and 15-minute) the performance of these four unsupervised learning techniques also examined and confirmed that the anomalies identified by DBSCAN are also consistent with the other three methods. Additionally, for robustness of results, we use UpSet Plots to incorporate the shared anomalies and found across the three unsupervised learning methods. Number of anomalies also depends on the volatility and time interval of cryptocurrencies, more volatile / high frequency more anomalies. The study presents sound methodological approach for facilitating financial monitoring and mitigating risks in the cryptocurrencies market, and provides useful information for market players, analysts and policymakers. These results emphasize the importance of choosing algorithms based on specific surveillance targets to promote greater stability in digital asset environments.
- Book Chapter
20
- 10.1007/978-981-15-5285-4_12
- Jul 26, 2020
Credit card fraud is a socially relevant problem that majorly faces a lot of ethical issues and poses a great threat to businesses all around the world. In order to detect fraudulent transactions made by the wrongdoer, machine learning algorithms are applied. The purpose of this paper is to identify the best-suited algorithm which accurately finds out fraud or outliers using supervised and unsupervised machine learning algorithms. The challenge lies in identifying and understanding them accurately. In this paper, an outlier detection approach is put forward to resolve this issue using supervised and unsupervised machine learning algorithms. The effectiveness of four different algorithms, namely local outlier factor, isolation forest, support vector machine, and logistic regression, is measured by obtaining scores of evaluation metrics such as accuracy, precision, recall score, F1-score, support, and confusion matrix along with three different averages such as micro, macro, and weighted averages. The implementation of local outlier factor provides an accuracy of 99.7 and isolation forest provides an accuracy of 99.6 under supervised learning. Similary in unsupervised learning, implementation of support vector machine provides an accuracy of 97.2 and logistic regression provides an accuracy of 99.8. Based on the experimental analysis, both the algorithms used in unsupervised machine learning acquire a high accuracy. An overall good, as well as a balanced performance, is achieved in the evaluation metrics scores of unsupervised learning. Hence, it is concluded that the implementation of unsupervised machine learning algorithms is relatively more suitable for practical applications of fraud and spam identification.
- Book Chapter
2
- 10.1007/978-981-15-9492-2_13
- Jan 1, 2021
Structural Health Monitoring (SHM) has become an area of continuous research with the ever-increasing demand for the safety of civil structures. The damage in civil structures can be detected using multimodal data from sensors, which presents instances of both damaged and undamaged data. The availability of damaged data in real life is difficult to obtain from a healthy structure and hence the problem of damage detection needs to be attempted using normal healthy data, and it becomes synonymous with the anomaly or novelty detection. One-Class classifiers work on the principle that the abundance of healthy data can be used to model an envelope of conditions which, if violated by any data instance, can be termed as damage or outlier detection. We have attempted an array of classifiers on a benchmark structure dataset (IASC-ASCE) both from supervised and unsupervised machine learning domain and propose a comparison between their success rates in determining damage in civil structures. We used classical techniques such as One-Class Support Vector Machines (OC-SVM), One-Class Isolation Forest (OC-IF), One-Class K-means clustering (OC-KMC), One-Class K-nearest neighbors (OC-KNN), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), One-Class Principal Component Analysis (OC-PCA), Local Outlier Factor (LOF) and One-Class Gaussian Distribution (OC-GD). These techniques were tested on the IASC (International Association for Structural Control)–ASCE (American Society of Civil Engineers) SHM benchmark problem for a range of noise levels and range of force intensities to cover wide variations in the generated dataset using MATLAB based simulation. Our study helps us conclude that OC-SVM, Isolation Forest, and OC-PCA are the most robust algorithms for the anomaly detection task.KeywordsAnomaly detectionMachine learningStructural health monitoringSupport vector machinesIsolation forestK-means clusteringK-nearest neighborsDensity-based spatial clustering of applications with noisePrincipal component analysisLocal outlier factorGaussian distribution
- Preprint Article
- 10.5194/egusphere-egu25-19050
- Mar 15, 2025
In a mountainous watershed, there are many confluences at which two or more streams join. Due to inaccessible terrain and associated costs, river discharge data is collected only at a few confluences. It is, therefore, important to assess which confluence is critical. By critical, we mean the junction which will create maximum fragmentation in a river network. In this study, we analysed river networks with uneven topography in the Alaknanda River basin, which is vulnerable and prone to geo-hydro hazards. We applied Unsupervised Machine Learning (UML) algorithms such as Isolation Forest, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Linear Integer Programming (LIP) to identify the critical confluence locations. We compare our results with the well-established graph-based centrality metrics (Degree centrality, Betweenness centrality, Closeness centrality, and Eigen Vector Centrality). Our results suggest that DBSCAN outperformed other approaches in terms of detecting crucial nodes. We obtained better results using LIP than other techniques except DBSCAN. The outcome of this study will help the Central Water Commission, in deciding which confluence to focus on, and in assessing the locations of new gauges.Keywords: Critical nodes; Alaknanda Basin; Machine Learning; Hazards
- Conference Article
8
- 10.1109/csnet50428.2020.9265466
- Oct 21, 2020
Nowadays, complex attacks like Advanced Persistent Threats (APTs) often use tunneling techniques to avoid being detected by security systems like Intrusion Detection System (IDS), Security Event Information Management (SIEMs) or firewalls. Companies try to identify these APTs by defining rules on their intrusion detection system, but it is a hard task that requires a lot of time and effort. In this study, we compare the performance of four unsupervised machine-learning algorithms: K-means, Gaussian Mixture Model (GMM), Density-Based Spatial Clustering of Applications with Noise (DBSCAN), and Local Outlier Factor (LOF) on the Boss of the SOC Dataset Version 1 (Botsv1) dataset of the Splunk project to detect malicious DNS traffics. Then we propose an approach that combines DBSCAN and K Nearest Neighbor (KNN) to achieve 100% detection rate and between 1.6% and 2.3% false-positive rate. A simple post-analysis consisting in ranking the IP addresses according to the number of requests or volume of bytes sent determines the infected machines.
- Research Article
9
- 10.1016/j.sciaf.2024.e02386
- Sep 19, 2024
- Scientific African
Anomaly detection using unsupervised machine learning algorithms: A simulation study
- Research Article
4
- 10.1144/geochem2024-009
- Aug 26, 2024
- Geochemistry: Exploration, Environment, Analysis
This paper compares three unsupervised machine-learning algorithms – local outlier factor (LOF), Isolation Forest (iForest) and one-class support vector machine (OCSVM) – for anomaly detection in a multivariate geochemical dataset in northeastern Iran. This area contains several Au, Cu and Pb–Zn mineral occurrences. The methodology incorporates single-element geochemistry, multivariate data analysis and application of the three unsupervised machine-learning algorithms. Principal component analysis unveiled diverse elemental associations for the first seven principal components (PCs): PC1 shows a Co–Cr–Ni–V–Sn association indicating a lithological influence; PC2 shows a Au–Bi–Cu–W association suggesting epithermal Au mineralization; PC3 shows variability in Zn–V–Co–Sb–Cu–Cr; PC4 shows a Au–Cu–Ba–Sr–Ag association indicating Au and polymetallic mineralization; PC5 reflects Zn–Ag–Ni–Pb related to hydrothermal mineralization; and PC6 and PC7 show element associations suggesting epithermal and intrusive-related polymetallic mineralization. It was found that OCSVM performed slightly better than LOF and iForest in detecting anomalies associated with known Cu occurrences, and it successfully delineated dispersion from all known Au occurrences. LOF outperformed iForest and OCSVM in identifying all four Pb–Zn occurrences, and the three methods substantially limited the areas of the anomaly class. The analysis showed that LOF produced a less cluttered anomaly map compared to the isolated patterns in the iForest map. LOF was accurate in identifying anomalies associated with Au–Pb mineralization, while iForest detected anomalies associated with Pb–Zn–Cu occurrences and neighbouring Pb–Zn occurrence. OCSVM performed similarly in the northern and western areas but displayed unique discrepancies in the SE and west by detecting anomalies associated with two Cu occurences and a Pb–Cu occurrence. This study examined the influence of contamination fraction on detection of geochemical anomalies, revealing a noteworthy rise in the count of mineral occurrences delineated by anomalies when the contamination fraction increases from 5 to 10%. However, even with a 35% contamination fraction, some Cu occurrences remained outside the anomaly category, indicating potentially overlooked geochemical signals from mineral occurrences due to sampling schemes.
- Research Article
- 10.3390/app15052621
- Feb 28, 2025
- Applied Sciences
Clustering algorithms are widely used in statistical data analysis as a form of unsupervised machine learning, playing a crucial role in big data mining research for Maritime Intelligent Transportation Systems. While numerous studies have explored methods for optimizing ship trajectory clustering, such as narrowing dynamic time windows to prevent errors in time warp calculations or employing the Mahalanobis distance, these methods enhance DBSCAN (Density-Based Spatial Clustering of Applications with Noise) by leveraging trajectory similarity features for clustering. In recent years, machine learning research has rapidly accumulated, and multiple studies have shown that HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) outperforms DBSCAN in achieving accurate and efficient clustering results due to its hierarchical density-based clustering processing technique, particularly in big data mining. This study focuses on the area near Taichung Port in central Taiwan, a crucial maritime shipping route where ship trajectories naturally exhibit a complex and intertwined distribution. Using ship coordinates and heading, the experiment normalized and transformed them into three-dimensional spatial features, employing the HDBSCAN algorithm to obtain optimal clustering results. These results provided a more nuanced analysis compared to human visual observation. This study also utilized O notation and execution time to represent the performance of various methods, with the literature review indicating that HDBSCAN has the same time complexity as DBSCAN but outperforms K-means and other methods. This research involved approximately 293,000 real historical data points and further employed the Silhouette Coefficient and Davies–Bouldin Index to objectively analyze the clustering results. The experiment generated eight clusters with a noise ratio of 12.7%, and the evaluation results consistently demonstrate that HDBSCAN outperforms other methods for big data analysis of ship trajectory clustering.
- Research Article
9
- 10.3390/su142013328
- Oct 17, 2022
- Sustainability
To reduce the operating cost and running time of demand responsive transit between urban and rural areas, a DBSCAN K-means (DK-means) clustering algorithm, which is based on the density-based spatial clustering of applications with noise (DBSCAN) and K-means clustering algorithm, was proposed to cluster pre-processing and station optimization for passenger reservation demand and to design a new variable-route demand responsive transit service system that can promote urban–rural integration. Firstly, after preprocessing the reservation demand through DBSCAN clustering algorithm, K-means clustering algorithm was used to divide fixed sites and alternative sites. Then, a bus scheduling model was established, and a genetic simulated annealing algorithm was proposed to solve the model. Finally, the feasibility of the model was validated in the northern area of Yongcheng City, Henan Province, China. The results show that the optimized bus scheduling reduced the operating cost and running time by 9.5% and 9.0%, respectively, compared with those of the regional flexible bus, and 4.5% and 5.1%, respectively, compared with those of the variable-route demand response transit after K-means clustering for passenger preprocessing.
- Research Article
- 10.26740/jinacs.v6n02.p532-540
- Jul 18, 2024
- Journal of Informatics and Computer Science (JINACS)
Pada setiap semester dalam universitas terdapat kuisioner berupa penilaian terhadap kinerja dosen. Evaluasi kinerja dosen yang terdapat di Universitas Negeri Surabaya merupakan proses penting untuk memastikan bahwa dosen telah memenuhi tugas dan tanggung jawabnya dalam menyampaikan pendidikan berkualitas terhadap mahasiswanya. Pada penelitian ini terdapat 22 instrumen pertanyaan menggunakan Skala Likert yang diisi oleh mahasiswa untuk menilai kinerja dosen. Terdapat 1055 dosen yang diolah untuk mendeteksi bagaimana kinerja dosen apakah sesuai dengan Rancangan Pembelajaran Semeste (RPS) atau terdapat dosen yang ketika mengajar tidak sesuai RPS. Oleh karena itu, metode deteksi anomali diterapkan untuk mengetahui kinerja dosen yang menyimpang atau tidak seperti biasanya. Dengan metode tersebut, maka dapat digunakan algoritma Local Outlier Factor (LOF) dan Isolation Forest karena lebih efisien dalam menangani data yang besar dan bekerja dengan cepat dalam ruang fitur. Data yang digunakan belum terdapat label untuk menghitung sehingga digunakan metode klastering kmeans untuk memperoleh label dari LOF dan IF. Kemudian pada cluster kmeans didapatkan 3 cluster, yaitu cluster 0 terdiri dari 279 data points, cluster 1 terdiri dari 597 data points, dan cluster 2 terdiri dari 179 data points. Dari hasil cluster tersebut akan digunakan untuk memperoleh nilai dari label LOF dan label IF dalam perhitungan evaluasi hasil komparasi. Pada anomali yang diterapkan dengan algoritma LOF yaitu terdapat 19 dosen terdeteksi anomali dan pada algoritma IF terdapat 22 dosen terdeteksi anomali. Pada evaluasi yang digunakan untuk memperoleh hasil komparasi yaitu menggunakan rand index score dan silhouette score. Didapatkan nilai dari rand index dari LOF sebesar 0.438 dan IF sebesar 0.441. Kemudian hasil dari silhouette score LOF sebesar 0.0019 dan IF sebesar 0.0377. Kata Kunci : Kinerja dosen, LOF, IF, rand index, silhouette score
- Research Article
- 10.35629/5252-07043443
- Apr 1, 2025
- International Journal of Advances in Engineering and Management
The study aims at clustering, segmentation customer using K-means clustering model, hierarchical clustering model, Density-based Spatial Clustering of Applications with Noise (DBSCAN) model and customer segmentation frameworks (RFM), A few unsupervised machine learning (ML) clustering models such as K-means clustering model, hierarchical clustering model, Density-based Spatial Clustering of Applications with Noise (DBSCAN) and customer segmentation frameworks (RFM), to identify distinct and actionable customer segment based on their behavioral, demographic and transactional characteristics. The traditional model was included in the analysis since clustering models are not optimization models and the goodness of unsupervised models could only be evaluated with a practical business approach. Theresults and discussion of the paper highlightscustomer segmentation based on their historical transactional characteristics, evaluating the effectiveness of different clustering algorithms and customer segmentation framework the potential for future enhancements. The emphasis on statistical analysis and the evaluation of various clustering techniques provide valuable insights into effective customer segmentation strategies
- Research Article
4
- 10.1088/1755-1315/1370/1/012005
- Jul 1, 2024
- IOP Conference Series: Earth and Environmental Science
Wind energy has experienced significant growth in recent years thanks to the technological development of wind turbines (WTs). However, one of the main challenges for the wind industry remains the early detection of WT failures. An effective strategy to address this challenge is implementing condition monitoring (CM) to detect changes in WT operation that could indicate the onset of a potential failure. This paper uses data from the SCADA (Supervisory Control and Data Acquisition) system of a wind farm located in Ecuador to test three unsupervised machine learning (ML) methods to detect anomalies in the data, allowing for predicting potential WT failures. Evaluation metrics showed that the Mahalanobis Distance (MD) algorithm performed better in anomaly detection over Isolation Forest (IF) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN), achieving an accuracy of 0.94, 0.90 and 0.74 respectively; however, IF more effectively detected the points determined as anomalies.
- Conference Article
44
- 10.1109/icesc48915.2020.9155615
- Jul 1, 2020
Development of communication technologies and e-commerce has made the credit card as the most common technique of payment for both online and regular purchases. So, security in this system is highly expected to prevent fraud transactions. Fraud transactions in credit card data transaction are increasing each year. In this direction, researchers are also trying the novel techniques to detect and prevent such frauds. However, there is always a need of some techniques that should precisely and efficiently detect these frauds. This paper proposes a scheme for detecting frauds in credit card data which uses a Neural Network (NN) based unsupervised learning technique. Proposed method outperforms the existing approaches of Auto Encoder (AE), Local Outlier Factor (LOF), Isolation Forest (IF) and K-Means clustering. Proposed NN based fraud detection method performs with 99.87% accuracy whereas existing methods AE, IF, LOF and K Means gives 97%, 98%, 98% and 99.75% accuracy respectively.
- Research Article
17
- 10.1016/j.datak.2017.12.001
- Dec 18, 2017
- Data & Knowledge Engineering
Spatio-temporal outlier detection algorithms based on computing behavioral outlierness factor
- Research Article
- 10.1109/access.2023.3253022
- Jan 1, 2023
- IEEE Access
With the advent of technology, data and its analysis are no longer just values and attributes strewn across spreadsheets, they are now seen as a stepping stone to bring about revolution in any significant field. Data corruption can be brought about by a variety of unethical and illegal sources, making it crucial to develop a method that is highly effective to identify and appropriately highlight the various corrupted data existing in the dataset. Detection of corrupted data, as well as recovering data from a corrupted dataset, is a challenging problem. This requires utmost importance and if not addressed at earlier stages may pose problems in later stages of data processing with machine or deep learning algorithms. In the following work we begin by introducing the PAACDA: Proximity based Adamic Adar Corruption Detection Algorithm and consolidating the results whilst particularly accentuating the detection of corrupted data rather than outliers. Current state of the art models, such as Isolation forest, DBSCAN also called “Density-Based Spatial Clustering of Applications with Noise” and others, are reliant on fine-tuning parameters to provide high accuracy and recall, but they also have a significant level of uncertainty when factoring the corrupted data. In the present work, the authors look into the most niche performance issues of several unsupervised learning algorithms for linear and clustered corrupted datasets. Also, a novel PAACDA algorithm is proposed which outperforms other unsupervised learning benchmarks on 15 popular baselines including K-means clustering, Isolation forest and LOF (Local Outlier Factor) with an accuracy of 96.35% for clustered data and 99.04% for linear data. This article also conducts a thorough exploration of the relevant literature from the previously stated perspectives. In this research work, we pinpoint all the shortcomings of the present techniques and draw direction for future work in this field.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.