Automating flow cytometry

Carlos E Pedreira

doi:10.1002/cyto.a.22007

Abstract

FLOW cytometry data analysis has been traditionally based on the identification of cell populations by using strategies founded on the definition of gates in bi-dimensional plots, where an experienced operator selects the subpopulation(s) of interest. However, recent advances concerning the increase of flow cytometers multiparameter capabilities and also the introduction of sophisticated strategies (1), which allow for virtual infinite color flow cytometry, have increased the motivation and interest for new strategies for automated identification of cell populations. Automation schemes for flow cytometry data have been approached in Ref. 2 and more recently in Refs. 3–7. The article by Stuchl y et al., in this issue (page 120), introduces an interesting procedure aiming automation for flow cytometry generated proteomic data. It is important to emphasize the relevance of this contribution as it brings in a methodology that can be potentially used in a broad range of proteomic problems. Stuchl y et al. provided a tool to automatically find a set of clusters, constituted by color-coded microspheres, in a sequence of bi-dimensional spaces. Cluster analysis (8) is an assemblage of techniques that has been extensively and successfully used in engineering problems and more and more, especially in the last decade, is turning out to be a key tool in many medical applications. The main goal in cluster analysis is to identify underlying structures present in data. Accordingly, the aim is to track down groups of multidimensional points, in the case of Stuchl y et al. in this issue, groups of color-coded microspheres, which have similarities among them. More than that, one pursuits within-compact and well separated groups of points. There are many well-established clustering methods in the literature. Stuchl y et al. reported that they experimented some of these methods, among those the classical k-means and hierarchical clustering (8) and finally decided for a kmedoids family approach, specifically the partitioning around medoids (PAM) (9). At this point, it looks like beneficial to go a little bit around these choices and their potential applications in flow cytometry automation. The k-means, probably the most popular approach for clustering, is an easy-to-use, intuitive method. It groups the data by associating a representative, called centroid, to each of the clusters. The initial centroids are arbitrarily chosen-typically, although not necessarily, randomly-in the data-points space. Accordingly, if one aims to partition data into k clusters, one should initially choose (or draw) k centroids. The next step is to identify, for each centroid, the subset of datapoints that are closer to it than to any other of the centroids. Forthwith, one can calculate the means of each of the subsets, and place these means as the new centroids. These steps are done as a loop until none of the centroids change their location anymore. The k-medoids approach has also to be arbitrarily initiated but the clusters representatives (medoids) are chosen as one of the data-points, instead of the mean of the datapoints belonging to a given cluster, as for the k-means. The aim is then to find k medoids that minimize a measure of dissimilarity to all the data-points of the cluster it represents. The implications of this, apparently small, change are twofold: (i) First, the method becomes less sensible to the data-points that are far away from their cluster representative. In the k-means approach, eventual far away data-points could severely change the value of the means, used as the clusters representatives; (ii) Second, the counterpart is that this procedure is much harder from the computational point of view -for both, processing and memoryand furthermore, this

Full Text