Abstract

BackgroundRecent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. Spectral clustering in particular has proven to be a powerful tool amenable for many applications. However, it cannot be directly applied to large datasets due to time and memory limitations. To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. We call this entire algorithm SamSPECTRAL.ResultsWe tested our algorithm on flow cytometry data as an example of large, multidimensional data containing potentially hundreds of thousands of data points (i.e., "events" in flow cytometry, typically corresponding to cells). Compared to two state of the art model-based flow cytometry clustering methods, SamSPECTRAL demonstrates significant advantages in proper identification of populations with non-elliptical shapes, low density populations close to dense ones, minor subpopulations of a major population and rare populations.ConclusionsThis work is the first successful attempt to apply spectral methodology on flow cytometry data. An implementation of our algorithm as an R package is freely available through BioConductor.

Highlights

  • Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas

  • We applied SamSPECTRAL to four different flow cytometry datasets to demonstrate its applicability on a broad spectrum of flow cytometry data, and compared its performance to two state of the art model-based clustering methods optimized for flow cytometry data

  • Faithful sampling is based on potential theory. It reduces the size of input for spectral clustering algorithms and they can be efficiently applied on flow cytometry data in spite of its large size

Read more

Summary

Introduction

Recent biological discoveries have shown that clustering large datasets is essential for better understanding biology in many areas. It cannot be directly applied to large datasets due to time and memory limitations To address this issue, we have modified spectral clustering by adding an information preserving sampling procedure and applying a post-processing stage. A classical approach for analysing biological data is to first group individual data points based on some similarity criterion, a process known as clustering, and compare the outcome of clustering with the biological hypotheses An example of this approach is in the analysis of flow cytometry data where populations of cells that express specific intracellular or surface proteins are identified. Flow cytometry is a technique for measuring physical, chemical and biological characteristics of individual microscopic particles such as cells and chromosomes It has many applications in molecular and cell biology for both clinical diagnosis and research purposes [1]. As thousands of cells can be analyzed per second, cytometers can generate large-

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call