Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

Prem Bhusal,A K M Mubashwir Alam,Ning Jiang,Jun Xiao,Keke Chen

doi:10.1109/bigdata52589.2021.9671320

Abstract

The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecular level that allows researchers to understand the healthiness of a patient’s immune system. Recent studies have shown that the single-linkage clustering algorithm can give the best results for B cell clonality analysis – a critical type of immune repertoire sequencing (IR-Seq) analysis. Large sequence datasets (e.g., millions of sequences) are being collected to comprehensively understand how a specific person’s immune system evolves over different stages of disease development. However, the classical single-linkage clustering algorithm does not scale well to such large sequence datasets. Surprisingly, no study has been done to address this scalability issue for immunology research and development. We study three different strategies to scale up the single-linkage algorithm for sequence data. They include (1) the approximate single-linkage algorithm enhanced with the non-Euclidean indexing methods, (2) the Spark-based single-linkage algorithm (SparkMST) that was originally designed for vector data and now modified for sequence data, and (3) a new tree-based sequence summarization approach – SCT that aims to reduce the data for single-linkage clustering with well-preserved clustering quality.We have implemented these approaches and experimented with real sequence datasets for B cell clonality analysis. (1) The index-enhanced hierarchical clustering algorithm (e.g., VPT-HC using the Vantage-Point tree for indexing) preserves the clustering quality very well while significantly reducing the time complexity. (2) The SCT approach serving as a preprocessing step can effectively reduce data size for clustering. The overall clustering, SCT followed by VPT-HC, is the fastest among the evaluated single-machine algorithms. However, this approach also slightly affects the clustering quality. (3) The SparkMST parallel algorithm scales out nicely and also gives exact single-linkage clustering results. However, SparkMST is tied to the single-linkage algorithm and cannot be extended to general hierarchical clustering algorithms. Although this study focused on the specific application area: the B cell clonality analysis, we believe other sequence data analysis problems may find the developed scalable techniques useful.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An effective Sorensen‐single linkage clustering hybrid algorithm for cell formation problems in cellular manufacturing industry
Sathish S ... Karuppuswamy P
Concurrency and Computation: Practice and Experience | VOL. 33
Sathish S, et. al.Sathish S ... Karuppuswamy P
04 Mar 2019
Concurrency and Computation: Practice and Experience | VOL. 33

A Fast, Scalable SLINK Algorithm for Commodity Cluster Computing Exploiting Spatial Locality
Poonam Goyal ... Navneet Goyal
-
Poonam Goyal, et. al.Poonam Goyal ... Navneet Goyal
01 Dec 2016
01 Dec 2016

Abstract 4046: Immune repertoire sequencing reveals tumor microenvironment and tracks clonally expanded B cell and T cell in blood
Chen Song ... Andrew Barry
Cancer Research | VOL. 79
Chen Song, et. al.Chen Song ... Andrew Barry
01 Jul 2019
Cancer Research | VOL. 79

The Effect of Different Similarity Distance Measures in Detecting Outliers Using Single-Linkage Clustering Algorithm for Univariate Circular Biological Data
Nur Syahirah Zulkipli ... Wan Nur Syahidah Wan Yusoff
Pakistan Journal of Statistics and Operation Research | VOL. -
Nur Syahirah Zulkipli, et. al.Nur Syahirah Zulkipli ... Wan Nur Syahidah Wan Yusoff
09 Sep 2022
Pakistan Journal of Statistics and Operation Research | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

Abstract

Talk to us

Similar Papers