Abstract

Cytometry datasets are currently being generated that will be orders of magnitude larger than have been seen in the recent past. Flow cytometers are now available that increase the number of parameters measured for each single cell by 50% (to 30). Mass cytometry (CyTOF) is a newly emerging technology capable of measuring over 50 individual markers simultaneously. Imaging cytometry can produce hundreds of quantitative parameters from image features. The information contained within these large and complex single cell datasets can only be realized with approaches to effectively integrate, analyze, interpret, and share them. It is widely recognized that the paradigm of manual data analysis is a rate limiting step and a primary source of variation in the application of cytometry for basic research and biomarker discovery 1. There is broadspread demand for the development of new software tools as the ability to organize, analyze, and exchange FCM data is lagging far behind the ability to run samples, to the detriment of health research and clinical applications of the technology 2. In this respect, the state-of-the-art hasn't changed since the first example of automated analysis of cytometry data was published 30 years ago in this journal. In that article, Murphy concluded “Unfortunately, the use of three or more independent fluorescent parameters complicates the analysis of the resulting data significantly” 3. Qiu et al. recently came to the same conclusion in the first CyTOF article—“Despite the technological advances in acquiring 30 parameters per single cell, methods for analyzing multi-dimensional single-cell data remain inadequate” 4. The recent development of many robust computational algorithms for FCM data analysis 5 is beginning to address these challenges. The articles included in this first of two parts of the Special Issue on Computational Analysis of Flow Cytometry Data present a broad representation of the state-of-the-art in the field of cytometry informatics. The impetus for this Special Issue was the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) sessions at the CYTO 2014 meeting in Fort Lauderdale where many of the approaches were presented. The goal of FlowCAP is to advance the development of computational methods for the identification of cell populations of interest in flow cytometry data 6. FlowCAP provides the means to objectively test these methods, first by comparison to manual analysis by experts using common datasets, and second by prediction of a clinical/biological outcome. The automated identification of rare cell populations remains an important but considerable challenge for automated analysis methods, even when those cell populations are known in advance. These rare populations, by definition, contribute little information compared to more abundant cell subsets. In this issue, Qiu [this issue, page 594] presents the strategy used in FlowCAP-III to identify rare cell populations composing 0.02% and 0.4% of total cells. The approach down-samples abundant cell subsets, thus putting the rare and abundant cells on an equal footing prior to training an ensemble classifier used to identify individual cells as members of one of the two rare populations of interest. The strategy handles sample to sample variability by clustering samples into similar batches prior to training, and processing them separately. The approach achieved amongst the highest prediction accuracy on testing data. In a typical FCM experiment, investigators seek to identify all biological-meaningful cell populations in an individual sample, and then to match these cell populations across samples for quantitative comparative analysis. In this issue, Sörensen and coworkers [this issue, page 603] present immunoClust to address these goals computationally. In the first step, immunoClust uses a finite mixture model and estimation maximization-based iterative clustering method to group cell events into population clusters in individual samples, and then reduces the quantitative marker data into a series of statistical parameters for each cell cluster. In the second step, immunoClust again uses iterative mixture modeling to perform a meta-clustering of cell clusters based on the derived statistical parameters for cross-sample cell population matching. The authors demonstrate the utility of this approach for the analysis of blood samples with defined cell compositions following immunodepletion, a FlowCAP dataset designed for the identification of rare cell populations, and a high dimensional CyTOF dataset with convincing results. FlowCAP was also used as the basis for the work by Tong and coworkers [this issue, page 616] presented in this issue. They used a combination of supervised and unsupervised learning techniques to address the FlowCAP-IV challenge. A common approach in analysis is to first simplify the problem at hand as much as possible, and their gEM/GANN does this by using an unsupervised learning method called Expectation Maximization (EM) to remove doublet cells, analogous to what is down during manual analysis. Then a supervised learning method called Genetic Algorithm-Artificial Neural Network (GANN) is used to identify biomarkers. While their approach had some success, they hypothesize several possible causes for differences in the performance between test and training data including the possibility that predictive cell populations were undetectable by their approach due to being in low frequency bin channels, the imbalance of samples in disease groups, and significant variations in progression time of the patients and inconsistent sample profiles in the disease groups. Human error in data analysis has previously been noted as one of the largest sources of variation in flow cytometric analysis of heterogeneous samples 1. Baradez and coworkers [this issue, page 624] in this issue demonstrate that a pipeline of automated data analysis tools including a normalization algorithm, together with automatic fluorescence standardization, can significantly reduce technical variations and improve reproducibility across multiple runs. They then demonstrate the application of their work to cell product authentication and quality control by quantitative analysis of a range of human cell lines. Visualization and presentation of high dimensional cytometry data is an important consideration for data interpretation that is gaining increasing attention. Van Gassen and coworkers [this issue, page 636] present an approach using self-organizing maps, called FlowSOM, to cluster flow cytometry data into cell populations. Relationships between similar cell populations are presented as a minimal spanning tree, similar to the SPADE algorithm 4, and a multiparametric summary of each cell population is presented as a star-chart. The approach provides a convenient overview of the behavior of all cell subsets and all markers simultaneously. While significant progress has been made for the automated gating of cell populations, significant challenges remain when there is uncertainty in the definition (i.e., placement) of the gate due to population overlaps. Bagwell and coworkers [this issue, page 646] introduce a novel approach based on Probability State Modeling (PSM) theory to capture such uncertainty and improve gating. In particular, the authors show how the use of cumulative distribution (and their inverse) can improve the identification of cut-points compared to approaches based on densities. They illustrate their approach on several B cell datasets and show how PSM enhances the accuracy of locating B-cell subsets. Many of the approaches described in this issue focus on the automated analysis of high-dimensional datasets. In contrast, Günther and Müller [this issue, page 661] describe a strategy for identifying multiple populations of microbes using two scatter and only one fluorescence dimension. Sequential analysis of “slices” of a scatterplot using two dimensions facilitated the manual gating of clusters as they appeared in each slice, and the reproducibility of the procedure was improved over the well-known irreproducibility of manual operators gating on sequential bivariate plots. This facilitated gating was also more effective in identifying small populations in the large, three-dimensional datasets. The profiling of the cell-surface phenotype of adipose stromal cells also used a low number of dimensions in individual samples [Donnenberg et al. this issue, page 665], but very high numbers of dimensions were accumulated by analyzing many aliquots each stained with one of the 242 antibodies available on freeze-dried lyoplates. The resulting high-dimensional data were clustered by k-means to establish a detailed map of the antigens expressed on adipose stromal cells (ASC). This clustering method should allow standardization across different cytometry platforms, thus helping to identify variations in such populations between studies. The investigation of antigen-specific cellular immune responses relies on the ability to detect and quantify rare cell subsets. In this issue, Lin and coworkers [this issue, page 675] demonstrate the use of an analytical pipeline that combines semi-automated gating using the OpenCyto R package and dimensionality reduction using t-distributed stochastic neighbor embedding (t-SNE) to compare the levels of polyfunctional T cells expressing multiple intracellular cytokines in response to either Mycobacterium tuberculosis infection or human immunodeficiency virus antigen vaccination. The pipeline was able to effectively illuminate differences in rare cell subsets between infection and vaccination cohorts, revealing the heterogeneity in antigen-specific T cell subsets at the single cell level. Although the method was tested for the detection of antigen-specific T cell populations, similar pipelines could be used in any setting where the detection of rare cell subsets is required. The lack of software interoperability with respect to gating has traditionally been a bottleneck preventing the use of multiple analytical tools and the reproducibility of flow cytometry data analysis by independent parties. In this issue, Spidlen and colleagues [this issue, page 683] present Gating-ML 2.0, a computer file format to encode and interchange gates. This format is an ISAC Recommendation that has been significantly simplified in comparison to the previous Gating-ML 1.5 Candidate Recommendation. In the opinion of ISAC's Data Standard Task Force and Council, Gating-ML is sufficiently mature and tested so as to be widely adopted by instrument vendors and third party software developers. This simplification has facilitated its support in several software tools already. The results presented here and those deriving from the FlowCAP efforts clearly indicate that computational cytometry analysis has come of age. Our hope is that this Special Issue will help educate the cytometry user community about how best to incorporate these computational methods into their routine research workflows.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call