Optimized Threshold Inference for Partitioning of Clones From High-Throughput B Cell Repertoire Sequencing Data.

Nima Nouri,Steven H Kleinstein

doi:10.3389/fimmu.2018.01687

Nima Nouri, Steven H Kleinstein

Open Access

PDF Available

https://doi.org/10.3389/fimmu.2018.01687

Copy DOI

Export

Save

Cite

Journal: Frontiers in immunology	Publication Date: Jul 26, 2018
Citations: 21	License type: CC BY 4.0

Affiliation: Yale University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

During adaptive immune responses, activated B cells expand and undergo somatic hypermutation of their B cell receptor (BCR), forming a clone of diversified cells that can be related back to a common ancestor. Identification of B cell clones from high-throughput Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data relies on computational analysis. Recently, we proposed an automated method to partition sequences into clonal groups based on single-linkage hierarchical clustering of the BCR junction region with length-normalized Hamming distance metric. This method could identify clonal sequences with high confidence on several benchmark experimental and simulated data sets. However, determining the threshold to cut the hierarchy, a key step in the method, is computationally expensive for large-scale repertoire sequencing data sets. Moreover, the methodology was unable to provide estimates of accuracy for new data. Here, a new method is presented that addresses this computational bottleneck and also provides a study-specific estimation of performance, including sensitivity and specificity. The method uses a finite mixture model fitting procedure for learning the parameters of two univariate curves which fit the bimodal distribution of the distance vector between pairs of sequences. These distributions are used to estimate the performance of different threshold choices for partitioning sequences into clones. These performance estimates are validated using simulated and experimental data sets. With this method, clones can be identified from AIRR-seq data with sensitivity and specificity profiles that are user-defined based on the overall goals of the study.

Highlights

Next-generation sequencing technologies are increasingly being applied to carry out detailed profiling of B cell receptors (BCRs, referred to as the immunoglobulin (Ig) receptors)
Identification of B cell clones from these high-throughput AIRR-seq data relies on computational analysis
We previously developed an automated approach for determining this threshold, and demonstrated that using this threshold with single-linkage clustering based on the length-normalized Hamming distance detects clones with high confidence on several benchmark data sets [4]

Summary

Introduction

Next-generation sequencing technologies are increasingly being applied to carry out detailed profiling of B cell receptors (BCRs, referred to as the immunoglobulin (Ig) receptors). The junction region is defined as the CDR3 plus the conserved flanking amino acid residues. These groups are hierarchically clustered based on the nucleotide similarity of their junction region, and partitioned by cutting the dendrogram at a fixed distance threshold. We previously developed an automated approach for determining this threshold, and demonstrated that using this threshold with single-linkage clustering based on the length-normalized Hamming distance (i.e., the absolute count of differences between two sequences divided by the length of the sequence) detects clones with high confidence on several benchmark data sets [4]. We propose and validate a computationally efficient threshold inference algorithm for partitioning BCR sequences into clones that allows for study-specific performance estimation

Methods

Results

Conclusion