Abstract Background Dyslipidemia encompasses a wide range of lipoprotein disorders categorised through two classifications (Fredrickson-Levy [FL] or Sniderman).(1,2) However, both classifications are criticised for relying on incomplete knowledge of lipoprotein metabolism, especially with the emergence of novel treatment options and variations in individual treatment responses.(3) Clustering, an unsupervised machine learning (ML) algorithm that can process a wide range of variables, has the potential to unmask patient groups with distinct molecular profiles and unique therapeutic targets that can inform more effective prevention strategies for cardiovascular disease (CVD).(4) Aim We aimed to use unsupervised ML algorithms to discover intrinsic dyslipidaemia categories from lipoprotein measurements, recognise the necessary components of lipid panels for classification, and analyse the similarities between the newly formed clusters, FL and Sniderman classifications. Methods Lipid profiles of 5,080,248 patients were obtained from the ‘Very Large Database of Lipids’ database. This yielded up to 78 blood components per patient, including at least 31 lipoprotein variables. The analysis involved unsupervised K-means clustering with optimised values for K and the subset of variables, determined in an unsupervised manner using a suitable measure of complexity. We then interpreted our clusters using probabilistic decision trees to provide compact and interpretable representations. Finally, we compared the clusters with Sniderman and FL categories. Results In a completely unsupervised fashion, we identified 14 clusters that could be matched to Sniderman categories. The confusion matrix showed total agreement of 76% (see Figure 1, left panel), relative Cohen’s kappa of 0.78 (the relative version captures accuracy on categories containing smaller numbers of patient profiles) and an accuracy of 96% on the small Type III class. Similar results were observed when matching to FL types. We accurately represented our clusters using probabilistic decision trees of small depth (see Figure 2). We discovered that the data had low intrinsic dimension and a manifold-like structure in which the different clusters could be illustrated (see Figure 1, right panel). Specifically, only 3 variables were needed to obtain our classification: apolipoprotein b, total cholesterol and triglycerides. Conclusion We showed that completely unsupervised ML techniques can uncover dyslipidaemia categories in lipoprotein profiles from a large patient population. The categories largely align with existing classifications based on prior knowledge of lipoprotein metabolism. Furthermore, few lipoprotein variables were required for categorisation (low-dimension data), which could aid in determining which lipoproteins should be measured in a clinical setting. Further analysis of the differences between ML clusters and traditional classifications is needed, which may enhance CVD risk management.
Read full abstract