Abstract

To the Editor: We read with great interest the article by Hadanny et al,1 published in Neurosurgery. In this study, the authors used an approach which augmented unsupervised and supervised machine learning methods to predict response to spinal cord stimulators. This paper is an excellent example of how both approaches can be used together to prognosticate on patient outcomes, and the methodological implications span beyond the impact on patients with spinal cord stimulators. Unsupervised machine learning (UML) operates on the premise that the ground truth is not known—for this reason, there are no training data involved in the model. The outcomes of UML algorithms are usually unique clusters identified in the data set. This varies from supervised machine learning (SML) algorithms, which construct a model based on known input and output data. In the literature, UML methods are commonly used to identify novel and emergent phenotypes, whereas SML has commonly been used to perform outcome prediction.2,3 When using UML methods, there are several important parameters to consider which may influence the outcomes seen. First parameter is the cluster number identification. In this paper, the authors used the K-nearest neighbor algorithm (KNN), preceded by an elbow-based cluster optimization step. The KNN algorithm itself is not a completely unsupervised algorithm and requires the inclusion of the number of preset clusters. It requires a cluster optimization step to identify this parameter. In their supplementary methods, they argue that, using the elbow method, a pseudoelbow was created at K = 3. As a result, the decision is made that two clusters exist in the data. We encourage the authors to consider the use of other approaches to validate this finding, including the silhouette method, which has been shown to be more robust and quantifiable in some cases and may be less prone to subjective examination of the analyst.4 The number of clusters is also not readily apparent in the reprojections to the principal component analysis domain space. We suggest the use of other dimensionality reduction approaches to visualize the clusters, namely t-distributed stochastic network embedding or uniform manifold approximation and projection. While the clustering should remain in the multidimensional space, dimensionality reduction approaches have been useful in visualizing multidimensional data otherwise difficult to appreciate. Second parameter is the algorithm itself. KNN is a very well-documented partitioning algorithm. However, it has several limitations. First, the algorithm requires the use of numerical data, which may distort the categorical data, such as patient biological sex, diagnosis, smoking status, and present inherent to the data set. In addition, the sample size required for use of the KNN method is also relatively large, with some studies suggesting the need for a sample size of 2^n, where n is the number of features (24 features in this study).5,6 Other clustering algorithms, such as hierarchical clustering, have been shown to be far more effective for smaller data sets, and we encourage the authors to explore the use of more robust algorithms that would better fit the constraints of the data. As more neurosurgical studies use machine learning, specifically UML, approaches for the analysis of data, it becomes very important for researchers to consider the aforementioned parameters to strengthen their observations and findings. We applaud Hadanny et al1 for an excellent presentation of how both UML and SML approaches can be augmented, and we look forward to additional novel contributions using both approaches in the future to better help understand complex disease processes and ultimately help patients.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call