Abstract

Across single-cell technologies, including flow and mass cytometry, as well as scRNA-seq, unsupervised clustering algorithms have become a staple of data analysis and are often hailed as a replacement for manual gating with the promise of an unbiased interrogation of the data. There is no shortage of software for the purpose and many tools are produced with user friendly graphical interfaces for the less programming inclined part of the community. The algorithms boast a wide range of features: some excel at detecting rare cell populations, some provide suggestions for the number of different cell subsets in the data, some are fast, some are highly reproducible, etc. Common to almost all of them is that they are oversold on at least one aspect: they almost never provide an unsupervised, unbiased answer at the click of a button, but rather prompt a semisupervised, iterative, interdisciplinary process of computational analysis (e.g. by a bioinformatician) and domain expert interpretation (e.g. by an immunologist, hematologist, disease specialist, etc.) until a biologically meaningful clustering is achieved 1 (Fig. 1). This is not to say that they are not useful—they certainly are—but the one-click, one size fits all analysis of single cell data remains elusive. In the wake of heavy developments in algorithms and tools follow extensive testing and reviewing. In a key review of cytometry clustering tools, Robinson & Weber (2016) 2 highlighted a number of algorithms performing well on parameters such as the ability to detect rare or even novel cell populations, the ability to produce results mirroring those achieved by manual gating of the data, the reproducibility of the results from run to run, and the run times of the algorithms. The FlowSOM algorithm 3 came out on top in terms of speed, which combined with good clustering reproducibility has made it a go-to algorithm in studies involving both flow and mass cytometry. The big advantage of FlowSOM and similar unsupervised clustering approaches over the traditional manual gating have been discussed extensively 1, 2, 4 with the key conclusion being that algorithmic clustering is not only more convenient than manual gating, but being unbiased by biological preconceptions, it also offers the potential to detect rare populations likely to be missed in manual approaches. There are, however, a number of features of automated clustering that users need to be aware of. Firstly, mathematically optimal clustering is not the same as biologically meaningful clustering. The unsupervised algorithms remain ignorant of decades of biological research, as well as the technical uncertainty of the data as produced by the various technologies 5, 6. We may know for a fact that two markers are never expressed simultaneously on the same cell lineage, but if the expression of all other markers happen to be similar, the algorithm will be none the wiser and likely combine the two cells in single cluster. This property can be argued both a feature and a bug at the same time—unbiased, naive data analysis is more likely to reveal rare or novel cell populations, but given the highly knowledge-based approach to constructing the phenotyping panels in these studies, how unbiased can we really expect the analysis results to be? Secondly, when evaluating the accuracy of clustering algorithms, we face the problem that we lack an objective benchmark—when attempting to expand the horizon of our current knowledge, the truth of course becomes a subjective matter, and even when simply attempting to replicate basic existing knowledge, the benchmark is usually a manually gated population, subject to the gating strategy applied. Lastly, no two algorithms produce the same results, and sometimes this is not even the case for two runs of the same algorithm on the same data. The reason for the latter is that most, if not all, of the widely used algorithms utilize a random start (meaning that unless the same seed is used, non-identical clusterings will result from each run). This is done to speed up these highly computationally demanding algorithms and to speed up the analysis even further, users will often randomly downsample their data, which will of course also not produce exactly the same results each time a new sampling is done. The effects of these tricks range from basically undetectable to sometimes affecting the downstream biological interpretation of the results 2, 7, 8. The bottomline is this: the workflow in algorithmic clustering of cytometry data is rarely an unsupervised process, but more likely a semi-supervised iterative process of computational analysis and domain expert interpretation. In the November issue of Cytometry Part A (pages 1191–1197), Lacombe et al. describe an approach for analysis of flow cytometry data using FlowSOM 3 for clustering and the commercially licensed software Kaluza for interpreting the results, thereby providing a framework facilitating efficient application of the semisupervised approach to clustering. By using Kaluza for post-processing, it is not only made easy for non-programmers to interrogate the resulting clusters using mean fluorescence intensity, cell numbers and percentages, and 2D histograms, but the labeling strategy can also be saved as a protocol for future use. One additional highlight in this work is the use of two approaches for assigning additional “case” cells to existing “control” clusters: one based solely on healthy reference samples, and one including samples from both healthy individuals and leukemia patients. By first clustering and labeling healthy samples alone and subsequently “mapping” the disease samples of interest to the predefined clusters of the healthy cells, it is possible to gain more control of the output as the healthy hematopoietic lineages are more easily defined than malignant ones. A similar approach has previously been suggested for mass cytometry data 9, but, as discussed by Lacombe et al., while the projection of diseased cells—in their case leukemic cells—into the predefined clusters of healthy cells can be beneficial for the stable subsets, it also limits the opportunity for novel discoveries of populations that are unique to patient samples. As a result, the proposed method also includes a clustering scheme in which leukemic and healthy samples are clustered together, as means to detect minimal residual disease cells that are solely present in the leukemic samples. The work by Lacombe et al. very nicely exemplifies semisupervised iterative analysis of cytometry data, and how this can be applied to answer a real-life research question. Additionally, their framework requires only little programming skills and is consequently accessible to most of the community, enabling researchers without programming skills to conduct the whole analysis themselves. Other groups have suggested similar procedures for analysis of both flow and mass cytometry data 1, 4, 8, 10, and both commercial and academic solutions (e.g. Cytosplore, Astrolabe, and Cytobank) to facilitate the process to various degrees do exist. However, for most of the free-to-use academic bioinformatics tools, smooth iterations of analysis and interpretation are limited by the user friendliness (many of the most popular algorithms are only command line executable) and run times of the analysis tools. FlowSOM is, as mentioned, a very fast clustering algorithm, which allows for the clustering of a million cells with 15 channels in ~1 min, when using a high-performance computing dedicated CPU with 128 GB memory. However, other popular clustering algorithms including Phenograph 11, and X-shift 12 have much longer run times and higher memory requirements, with Phenograph taking ~50 min for a million cells with 15 channels and X-shift taking ~4 h for just 250,000 cells. As can be seen in Figure 2A, the run time for X-shift scales very poorly prohibiting implementation in the iterative approaches. Both X-shift and Phenograph also face memory issues with higher cell counts, making it infeasible to run the algorithms on large data sets on a personal computer. Overall, the majority of cytometry clustering algorithms (with FlowSOM being the notable exception) are computationally demanding and time-consuming to run. When considering run times of computational analyses in semisupervised frameworks, it is important to also consider the additional frequently used analysis tools such as those for visualization of the data, for example, dimensionality reduction, density plots, and heatmaps. Dimensionality reduction is commonly used to examine the global structure of the data. In this category of algorithms, PCA is extremely fast even for millions of cells but because of the complex structures of cytometry data, t-SNE and UMAP are much more commonly used to visualize clusters 13. A drawback of these methods is their run times, with UMAP requiring 30 min and t-SNE requiring more than 3 h to process a million cells with 15 channels (Fig. 2B). Generally, most visualizations of the data can take time and, due to the large size of cytometry data sets, also be quite computationally demanding and in most cases require at least some programming skills. With all of this being said, clustering-based analysis of cytometry data has offered many new insights into the mechanisms of health and disease in the past decade, and efforts in developing better and more efficient computational analysis tools, as well as frameworks for interpretation, continue to enhance the knowledge output from immunophenotyping data. While a fast, unsupervised, unbiased approach to analyzing cytometry data has yet to see the light of day, one thing is certain: these are not only challenging, but exciting times for cytometry. The authors declare no conflict of interest. This work was funded by the Independent Research Fund Denmark (grant 8048-00078B to LRO).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call