Abstract

This is the second of two Special Issues focused on the Computational Analysis of Flow Cytometry Data. These Special Issues were built around the FlowCAP project, run under the direction of an open consortium of immunologists, bioinformaticians, statisticians, and clinical scientists who share the goal of advancing the development of computational methods for the identification of cell populations of interest in flow cytometry data 1. Three new algorithms that participated in the FlowCAP-IV challenge were highlighted in the previous Special Issue 2-4. Aghaeepour and members of the FlowCAP Consortium [this issue, page 16] present the comparative results of this challenge from the seven participating groups. The participants were provided a training set of 14 color data from 384 individuals from a long-standing natural history study of HIV. Using survival status on these individuals, the challenge was to predict outcome on a blinded and independent test set. Previous manual analysis over a period of several months by the team at the Vaccine Research Center at the NIH (widely recognized as experts in the field) who provided the dataset did not identify any biomarkers. Two automated analysis pipelines provided statistically significant predictive value in the blinded test set. Van Gassen and coworkers [this issue, page 22] present one of these approaches, a novel automated pipeline to identify and select informative cell subsets from cytometry data to build a survival regression model for predicting HIV disease progression as part of the FlowCAP IV challenge. Their method called FlowReMi, for Flow Density Survival Regression Using Minimal Feature Redundancy, combined two previously developed algorithms, an automated cell population identification method [flowDensity 5], and flowType 6 which uses cell partitions provided for each marker by either manual analysis or by clustering to enumerate all cell types in a sample, with a feature selection algorithm to identify informative, non-redundant features predictive of time to AIDS in a survival model. The authors evaluated three survival time prediction algorithms using the selected features, of which the random survival forest approach was the most successful, and obtained the best predictive performance of all methods submitted to the challenge. Their approach is based on four independent algorithmic steps that work independently of each other. As the authors pointed out themselves, it will be interesting to assess these different components in isolation to see whether this pipeline can be further improved. The second significantly scoring analysis pipeline in FlowCAP-IV included the flowType and flowDensity approaches used by Van Gassen, and RchyOptimyx 6, an algorithm that measures the importance of these cell types by correlating their abundance to external outcomes, such as disease state or patient survival, and distills the identified phenotypes to their simplest possible form. Overall, the results of the FlowCAP-IV study are important as they highlight the significant value of unsupervised analysis approaches that can mine the full high dimensional space of data. Automated algorithms have several advantages, including perhaps most importantly they can overcome limitations associated with laborious manual analysis of serial projections of two dimensional projections of the data on a computer screen, a process that does not scale to the next-generation of 30 parameter flow cytometry instruments becoming available alongside mass cytometry offerings of similar complexity. However, automated gating still has its own challenges, one of which is the consistent labeling of cell populations across multiple samples in the presence of biological variability and heterogeneity. Multiple cell populations may appear, disappear, or appear to merge into single populations, depending on which biological sample is being gated. Automated methods have struggled with this issue. Lee, McLachlan, and Pyne [this issue, page 30] describe the Joint Cluster Matching procedure, which models sample-to-sample variability through a hierarchical random-effects model, providing a mechanism to link common cell populations across samples. They show that their method successfully models the large variation within and between batches of samples and apply it to sample classification and de novo high dimensional automated gating on several well-studied, publicly available data sets. Another issue facing many flow cytometry algorithms is computational speed. Zaunders and coworkers [this issue, page 44] have addressed this problem with a novel approach (SOPHE) that involves dividing the data into a relatively low number of bins, and determining the shape of the data in each bin by second order polynomials. The bin data are then combined to define local maxima and hence clusters. SOPHE does not require an input cluster number to begin clustering, but does require multidimensional scaling (e.g., by the logical transform), and a user-determined bin volume threshold that determines the level of resolution. Although this is a relatively low-resolution method, it is very fast and shows excellent correspondence to manual gating of major populations. Even 33-dimensional CyTOF data are handled well, although the agreement with manual gating is not quite as good as with lower-dimensional data. However, the authors make the reasonable point that this may be due to the difficulty faced by manual operators in analyzing very high-dimensional data, rather than any problem with the algorithm. The authors show that SOPHE can be used iteratively to analyze selected populations at different levels of resolution, and it also appears likely that SOPHE may be very useful as a rapid pre-analysis step to define the major populations for normalization (registration) programs, and high-resolution clustering algorithms. Several computational algorithms for automated identification of cell populations in flow cytometry data have been reported 7. But once cell populations are identified in individual samples, it is often then important to be able to identify differences in cell populations between groups of samples that differ by some important biological property. Rebhahn and coworkers [this issue, page 59] report a novel augmentation of the SWIFT algorithm that collects cluster templates from multiple sample models and then allows these cluster templates to compete for assignment of individual cell events in different samples. They demonstrate that this approach sharpens the cell population differences between samples and allows them to identify a set of cell populations in human peripheral blood that change with age. Algorithms have also been shown to match or exceed the accuracy of expert analysts for identification of cell populations 1. However, matching these cell types across large datasets with different subjects, timepoints, and experimental conditions has remained a challenge. Hsiao and coworkers [this issue, page 71] describe a cross-sample cell population mapping algorithm based on the Friedman–Rafsky (FR) test. The FR statistic provides a non-parametric framework for comparison of the high-dimensional empirical distributions of two cell populations. Using a wide range of synthetic and real datasets, the authors demonstrate that the FR statistic outperforms existing matching methods (including those based on Kullback–Leibler divergence) under various types of technical and biological variations, including proportion differences and position shifts. The stringency of the matching algorithm is controlled by a single threshold. In datasets with limited variations, the threshold can be easily configured based on the distribution of the FR statistics. For more complex datasets (e.g., when some cell types are absent from some samples), the stringency of the method must be carefully tuned based on the reference sample (which can be the union of all samples) and the biological and/or clinical aims of the study. Kim and coworkers [this issue, page 89] developed an approach that compares two single-parameter histograms based on difference curves and their simultaneous confidence bands. They first converted the number of events per channel in each data file into observation frequencies for each of 1,024 fluorescence intensity channels, using linear interpolation to estimate missing values for channels that have zero events. They then took 100,0000 bootstrap samples from the original listmode files to create a smoothed histogram with confidence bands. By merging the bootstrapped data sets into one and subtracting one data set from the other on a channel-by-channel basis differences in relative frequency between any two conditions are constructed and quantified by calculating the area under the positive portion of the difference curve. These articles, together with those in the first special issue of this series and the wider state-of-the-art 7 clearly have demonstrated that automated analysis of flow cytometry has reached a level of maturity such that the computational methods developed can meet and in many cases exceed the performance obtained through expert manual analysis. This comprehensive suite of tools can be assembled into pipelines that can take primary data from the cytometer instrument and ultimately generate reports that summarize the results of experiments involving hundreds of files in easily interpretable visualizations that can trace back to the primary data. While challenges remain, algorithms available today can robustly address many of the needs of users for both the identification of biomarkers of interest during discovery analysis, and for the enumeration of specific cell populations of interest with the precision and recall required for clinical trials and patient diagnosis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call