Abstract

We describe algorithms for discovering immunophenotypes from large collections of flow cytometry samples and using them to organize the samples into a hierarchy based on phenotypic similarity. The hierarchical organization is helpful for effective and robust cytometry data mining, including the creation of collections of cell populations’ characteristic of different classes of samples, robust classification, and anomaly detection. We summarize a set of samples belonging to a biological class or category with a statistically derived template for the class. Whereas individual samples are represented in terms of their cell populations (clusters), a template consists of generic meta-populations (a group of homogeneous cell populations obtained from the samples in a class) that describe key phenotypes shared among all those samples. We organize an FC data collection in a hierarchical data structure that supports the identification of immunophenotypes relevant to clinical diagnosis. A robust template-based classification scheme is also developed, but our primary focus is in the discovery of phenotypic signatures and inter-sample relationships in an FC data collection. This collective analysis approach is more efficient and robust since templates describe phenotypic signatures common to cell populations in several samples while ignoring noise and small sample-specific variations. We have applied the template-based scheme to analyze several datasets, including one representing a healthy immune system and one of acute myeloid leukemia (AML) samples. The last task is challenging due to the phenotypic heterogeneity of the several subtypes of AML. However, we identified thirteen immunophenotypes corresponding to subtypes of AML and were able to distinguish acute promyelocytic leukemia (APL) samples with the markers provided. Clinically, this is helpful since APL has a different treatment regimen from other subtypes of AML. Core algorithms used in our data analysis are available in the flowMatch package at www.bioconductor.org. It has been downloaded nearly 6,000 times since 2014.

Highlights

  • Feature selection is the problem of identifying a representative set of features from a large dataset to construct a classification model

  • Whereas individual samples are represented in terms of their cell populations, a template consists of generic meta-populations that describe key phenotypes shared among all those samples

  • We have described a set of algorithms for feature selection in a collection of flow cytometry samples by identifying immunophenotypes

Read more

Summary

INTRODUCTION

Feature selection is the problem of identifying a representative set of features from a large dataset to construct a classification model. Current fluorescence-based technology supports the measurements of up to twenty proteins simultaneously in each cell [6], whereas atomic mass cytometry systems such as CyTOF [7] can measure more than forty markers per cell When thousands of such high-dimensional samples are produced in an experiment, researchers have no other alternative but to automate the data analysis. We extend our prior work [24, 25] and that of other researchers by clearly defining steps in template-based data analysis and developing a generic framework for robust classification and immunophenotyping. For this purpose, we have developed a scoring function that accounts for the diversity of the myeloid cell populations in the various subtypes of AML.

STEPS IN ANALYZING FC DATA
Removing Unintended Cells
Data Transformation and Variance Stabilization
Cell Population Identification
Registering Cell Populations across Samples
Overview of the Mixed Edge Cover Algorithm
Creating Templates from a Collection of Samples
Overview of the Template Construction
Comparisons among Different Algorithms for Creating Templates
Sample Classification Based on Templates
Classification Score of a Sample in the AML Dataset
The Healthy Dataset
Preprocessing and Spectral Unmixing
Variance Stabilization
Building Class Templates
Comparison with Alternative Approaches
The AML Dataset
Cell Populations in Healthy and AML Samples
Healthy and AML Templates
Identifying Meta-Clusters
Impact of Each Tube in the Classification
Classifying Test Samples
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call