Sumrep: A Summary Statistic Framework for Immune Receptor Repertoire Comparison and Model Validation.

Branden J Olson,Adrian J Shepherd,Anna Obraztsova,Jason A Vander Heiden,Frederick A Matsen,Pejvak Moghimi,Duncan Ralph,Chaim A Schramm,Mikhail Shugay,William Lees

doi:10.3389/fimmu.2019.02533

Abstract

The adaptive immune system generates an incredible diversity of antigen receptors for B and T cells to keep dangerous pathogens at bay. The DNA sequences coding for these receptors arise by a complex recombination process followed by a series of productivity-based filters, as well as affinity maturation for B cells, giving considerable diversity to the circulating pool of receptor sequences. Although these datasets hold considerable promise for medical and public health applications, the complex structure of the resulting adaptive immune receptor repertoire sequencing (AIRR-seq) datasets makes analysis difficult. In this paper we introduce sumrep, an R package that efficiently performs a wide variety of repertoire summaries and comparisons, and show how sumrep can be used to perform model validation. We find that summaries vary in their ability to differentiate between datasets, although many are able to distinguish between covariates such as donor, timepoint, and cell type for BCR and TCR repertoires. We show that deletion and insertion lengths resulting from V(D)J recombination tend to be more discriminative characterizations of a repertoire than summaries that describe the amino acid composition of the CDR3 region. We also find that state-of-the-art generative models excel at recapitulating gene usage and recombination statistics in a given experimental repertoire, but struggle to capture many physiochemical properties of real repertoires.

Highlights

B cells and T cells play critical roles in adaptive immunity through the cooperative identification of, and response to, antigens
Since l1 multinomial regression outputs a separate coefficient vector β for each response value, we aggregate by taking medians of each dataset-specific lasso ordering for each summary to get the final score
This yields a range of rankings to assess the variation in scores by summary and by inferential model

Summary

Introduction

B cells and T cells play critical roles in adaptive immunity through the cooperative identification of, and response to, antigens. The random rearrangement process of the genes that construct B cell receptors (BCRs) and T cell receptors (TCRs) allows for the recognition of a highly diverse set of antigen epitopes. Immune receptor repertoires are accessible for scientific research and medical applications through highthroughput sequencing, it is not necessarily straightforward to gain insight from and to compare these datasets. If these datasets are not processed, they are a list of DNA sequences. This can be a highly involved task, and so it is common to compare the gene usage frequencies and CDR3 length distributions of repertoire [7, 8], leaving the full richness of the CDR3 sequence and potentially interesting aspects of the germline-encoded regions unanalyzed

Methods

Results

Conclusion