Abstract

Article Figures and data Abstract Editor's evaluation Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract T-cell receptors (TCRs) encode clinically valuable information that reflects prior antigen exposure and potential future response. However, despite advances in deep repertoire sequencing, enormous TCR diversity complicates the use of TCR clonotypes as clinical biomarkers. We propose a new framework that leverages experimentally inferred antigen-associated TCRs to form meta-clonotypes – groups of biochemically similar TCRs – that can be used to robustly quantify functionally similar TCRs in bulk repertoires across individuals. We apply the framework to TCR data from COVID-19 patients, generating 1831 public TCR meta-clonotypes from the SARS-CoV-2 antigen-associated TCRs that have strong evidence of restriction to patients with a specific human leukocyte antigen (HLA) genotype. Applied to independent cohorts, meta-clonotypes targeting these specific epitopes were more frequently detected in bulk repertoires compared to exact amino acid matches, and 59.7% (1093/1831) were more abundant among COVID-19 patients that expressed the putative restricting HLA allele (false discovery rate [FDR]<0.01), demonstrating the potential utility of meta-clonotypes as antigen-specific features for biomarker development. To enable further applications, we developed an open-source software package, tcrdist3, that implements this framework and facilitates flexible workflows for distance-based TCR repertoire analysis. Editor's evaluation This paper introduces and validates a novel concept which will be of great interest to all those interested in T cell immunity and especially the T cell receptor repertoire. The concept builds on the idea that TCRs to the same antigen often share sequence similarities, which they quantify using a bespoke tool tcrdist3. Using this tool they develop the idea of a meta-clone, a set of TCRs sharing biochemical similarities and potentially recognising the same antigen. In this paper they further show that such clonotypes may show increased sharing between HLA-related individuals, and explore the use of such clonotypes in characterising antigen-specific immune response across cohorts of individuals. https://doi.org/10.7554/eLife.68605.sa0 Decision letter Reviews on Sciety eLife's review process Introduction An individual’s unique repertoire of T-cell receptors (TCRs) is shaped by antigen exposure and is a critical component of immunological memory (Emerson et al., 2017; Welsh and Selin, 2002). With the advancement of immune repertoire profiling, TCR repertoires are a largely untapped source of biomarkers that could potentially be used to predict immune responses to a wide range of exposures including viral infections (Wolf et al., 2018), tumor neoantigens (Ahmadzadeh et al., 2019; Chiou et al., 2021; Kato et al., 2018), or environmental allergens (Cao et al., 2020). However, the extreme diversity characterizing TCR repertoires, both within and between individuals, presents major hurdles to biomarker development. Using peptide—major histocompatibility complex (pMHC) tetramer sorting to focus on TCRs recognizing individual epitopes, which depends on knowing the peptide antigen and its MHC restriction, typically reveals that many distinct TCRs are able to recognize even a single pMHC (Coles et al., 2020; Meysman et al., 2019). This complicates detection of population-wide signatures of antigen exposure. Modeling (Elhanati et al., 2018) and empirical evidence (Soto et al., 2019) suggest that only 10–15% of single-chain TCRs are public or shared by multiple individuals. Furthermore, only a fraction of the repertoire can be sampled, making it difficult to reproducibly detect relevant TCR clonotypes from an individual, let alone reliably detect public clonotypes in a population; in practice, the problem can be exacerbated by heterogeneous repertoire sequencing depth, which affects the precision with which the frequency of rare TCRs can be estimated. Thus, individual T-cell clonotypes are currently suboptimal and underpowered for population-level investigations of TCR specificity, which limits their application in the development of TCR-based clinical biomarkers. In this study, we describe a framework for engineering ‘meta-clonotypes’: groupings of TCRs sharing biochemically similar complementarity determining regions (CDRs), which enable population-level biomarker development (Figure 1). Previously, we introduced TCRdist, a biochemically informed distance metric that enabled grouping of paired αβ TCRs by antigen specificity based on their sequence similarity (Dash et al., 2017). TCRdist is correlated with edit distance, but it can vary considerably among TCRs with identical edit distances (Figure 2). While other tools exist to identify statistically anomalous groups of TCRs within a single sample that may be indicative of a polyclonal response to antigenic selection (Glanville et al., 2017; Huang et al., 2020; Pogorelyy et al., 2019; Pogorelyy and Shugay, 2019; Ritvo et al., 2018; Shugay et al., 2015), the meta-clonotype framework has been developed for a different task: leveraging receptor–antigen associations determined from in vitro experiments to create public, antigen-associated meta-clonotypes from otherwise private TCRs. This application is made possible by a new open-source Python3 software package tcrdist3 that brings flexibility to distance-based repertoire analysis, allowing customization of the distance metric, and at-scale computation with sparse data representations and parallelized, byte-compiled code. Figure 1 Download asset Open asset T-cell receptor (TCR) meta-clonotypes. (A) Defining meta-clonotypes from antigen-associated TCRs. Sets of antigen-associated TCRs were used together with synthetic background repertoires to engineer TCR meta-clonotypes that define biochemically similar TCRs based on a centroid TCR and a TCRdist radius. For each antigen-specific clonotype, we used tcrdist3 to evaluate the proportion of TCRs spanned at different TCRdist radii within (i) its antigen-associated TCR set (black) and (ii) a synthetic control V- and J-gene-matched background set (purple). A synthetic background was generated using 100,000 Optimized Likelihood estimate of Immunoglobulin Amino acid sequences (OLGA)-generated TCRs and 100,000 TCRs subsampled from umbilical cord blood; OLGA-generated TCRs were sampled to match the V–J gene frequency in each MIRA receptor set, with weighting to account for the sampling bias (see Methods for details). The objective was to select the largest radius that includes no more than an estimated proportion of 1E−6 TCRs in the background. The subset of antigen-associated TCRs spanned by the selected radius were then used to develop an additional meta-clonotype motif constraint based on conserved residues in the complementarity determining region (CDR)3 (see Methods for details). An example logo plot shows the CDR3 β-chain motif formed from TCRs – activated by a SARS-CoV-2 peptide (MIRA55 ORF1ab amino acids 1316:1330, ALRKVPTDNYITTY) – within a TCRdist radius 16 of this meta-clonotype’s centroid TCR. (B) Quantifying meta-clonotype conformant TCRs in bulk repertoires. The definition of each TCR meta-clonotype can be used to quantify the frequency of similar TCRs in bulk repertoires. EXACT sequences match the meta-clonotype centroid at the amino acid level, RADIUS-conformant sequences diverge from the centroid by no more than the radius distance, and RADIUS + MOTIF conformant sequences is the subset of radius-conformant TCRs with a CDR3 sequences matching the meta-clonotype’s CDR3 motif. (C) Population-level analysis of TCR meta-clonotype frequency. The frequency of meta-clonotype conformant sequences in multiple bulk repertoires allows comparison across a population. In this study, to test whether meta-clonotypes carry important antigen-specific signals above and beyond individual clonotypes, we searched for meta-clonotype conformant TCRs in COVID-19 patients with repertoires collected 0–30 days after diagnosis. We found stronger associations with predicted HLA restrictions based on counts of meta-clonotype conforming TCRs compared to associations using counts of exact clonotypes. Figure 2 Download asset Open asset TCRdist compared to edit distance. (A) Correspondence between edit distance (x-axis) and TCRdist (y-axis) for MIRA55 T-cell receptors (TCRs) with matching TRBV genes. The grayscale colormap shows the percentage of TCRs with a given TCRdist score within each edit distance category. (B) Examples of complementarity determining region (CDR)3s with TCRdist varying between 6 and 24 units among sequences with edit distance 2 (2 substitutions) from a centroid with matching TRBV genes. TCR distances range based on differential penalties assigned to specific residue substitutions. The framework is based on TCR sequences that have been experimentally enriched for antigen recognition, most commonly by sorting T cells labeled by peptide–MHC multimers or by activation-induced markers upon stimulation (we refer to these as ‘antigen-associated’ TCRs). Each meta-clonotype is defined by an antigen-associated centroid TCR and a TCRdist radius chosen so that the expected frequency of antigen-naive receptors within the radius is low. A CDR3 ‘motif’ is constructed from the subset of antigen-associated TCRs within the radius to further refine the meta-clonotype’s specificity. Together the centroid receptor, radius, and the CDR3 motif can be used to search for conformant TCRs in large bulk-sequenced repertoires and quantify their frequency (Figure 1). As intended, we find that TCR centroids, which are often private, gain publicity as meta-clonotypes. The expanded publicity of meta-clonotypes provides an opportunity to develop population-level biomarkers that may depend on antigen-specific features of the TCR repertoire. Shifting the focus of repertoire analysis from clonotypes to meta-clonotypes increases statistical power; grouping similar clonotypes reduces the sparsity of finite repertoire samples and increases the precision with which antigen-specific cell abundance can be estimated. To demonstrate one potential application of meta-clonotypes and to characterize their ability to estimate the frequency of similar antigen-specific T cells in bulk-sequenced TCR repertoires, we apply the meta-clonotype framework to a large publicly available dataset of SARS-CoV-2 antigen-associated TCRs. The dataset comes from a recent study that sought to elucidate the role of cellular immune responses in acute SARS-CoV-2 infection and examined the TCR repertoires of patients diagnosed with COVID-19 disease. Researchers used an assay based on antigen stimulation and flow cytometric sorting of activated CD8+ T cells to identify SARS-CoV-2 peptide-associated TCR β-chains; the assay is called ‘multiplex identification of TCR antigen specificity’ or MIRA (Klinger et al., 2015) and the output is a set of predicted antigen-associated TCR sequences. Data from these experiments were released publicly in July 2020 by Adaptive Biotechnologies and Microsoft as part of ‘immuneRACE’ and their efforts to stimulate science on COVID-19 (Nolan et al., 2020; Snyder et al., 2020). The MIRA antigen stimulation assays identified 253 sets of 6 or more TCR β-chains associated with CD8+ T cells activated by exposure to SARS-CoV-2 peptides, with TCR sets analyzed ranging in size from 6 to 16,607 TCRs (Supplementary file 1b); we refer to these sets as MIRA0 through MIRA252 in rank order by their size. The deposited immuneRACE datasets also included bulk TCR β-chain repertoires from 694 patients within 0–30 days of COVID-19 diagnosis. Our analysis of these data demonstrates how it is possible to define public meta-clonotypes from sets of private antigen-associated TCRs and directly evaluates their ability to carry population-level antigen-specific signals in comparison with individual clonotypes. Results Experimental enrichment of antigen-specific T cells allows discovery of TCRs with biochemically similar neighbors Searching for identical TCRs within a repertoire – arising either from clonal expansion or convergent nucleotide encoding of amino acids in the CDR3 – is a common strategy for identifying functionally important receptors. However, in the absence of experimental enrichment procedures, observing T cells with the same amino acid TCR sequence in a bulk sample is rare. For example, in 10,000 β-chain TCRs from an umbilical cord blood sample, less than 1 % of TCR amino acid sequences were observed more than once, inclusive of possible clonal expansions (Figure 3A). By contrast, a valuable feature of antigen-associated TCRs is the presence of multiple receptors with identical or highly similar amino acid sequences (Figure 3A). For instance, 45% of amino acid TCR sequences were observed more than once (excluding clonal expansions) in a set of influenza M1(GILGFVFTL)-A*02:01 peptide–MHC tetramer-sorted subrepertoires from 15 subjects (Dash et al., 2017). Enrichment was evident compared to cord blood for additional peptide–MHC tetramer-sorted subrepertoires obtained from VDJdb (Shugay et al., 2018), though the proportion of TCRs with an identical or similar TCR in each set was heterogeneous. Figure 3 Download asset Open asset Experimental enrichment of antigen-associated T-cell receptors (TCRs) increases neighbor density. (A) TCR repertoire subsets obtained by single-cell sorting with peptide–major histocompatibility complex (MHC) tetramers (green), MIRA peptide stimulation enrichment (MIRA55, MIRA48; purple), or random subsampling of umbilical cord blood (1000 or 10,000 TCRs; blue). Biochemical distances were computed among all pairs of TCRs in each subset using the TCRdist metric. Neighborhoods were formed around each TCR using a variable radius (x-axis) and the percent of TCRs in the set with at least one other TCR within its neighborhood was computed; notably the line represents a summary of TCRs in each set and is therefore more precise for larger TCR sets. A radius of zero indicates the proportion of TCRs that have at least one TCR with an identical amino acid sequence (solid square). Dash BMLF (Epstein–Barr Virus), M1 (Influenza), and pp65 (Cytomegalovirus) refer to epitopes from Dash et al., 2017. ELAGIGILTV (Human Mart-1 antigen) and LLLGIFILV (HM1.24 antigen in multiple myeloma) downloaded from VDJdb (Shugay et al., 2018), which were submitted by Andrew Sewell et al. (B) Analysis of MIRA sets for which the participants contributing the TCRs were significantly enriched with a specific class I HLA allele Supplementary file 1c. Colors are assigned based on the vertical ranking of the lines along the right y-axis and match the order in the color legend. We investigated the degree to which the MIRA assay employed by Nolan et al., 2020 identified TCRs with identical or similar amino acid sequences. In general, across sets of MIRA-identified β-chain TCRs, each associated with a different antigen, the proportion of amino acid sequences observed more than once was generally lower than in the tetramer-enriched repertoires and varied considerably across the sets; some MIRA sets resembled tetramer-sorted subrepertoires (Figure 3B; see MIRA133), while others were more similar to unenriched repertoires (Figure 3B; see MIRA90). The increased diversity in MIRA-enriched TCR sets versus tetramer-enriched TCR sets may, in part, be explained by: (1) peptides being presented by the full complement of the native host’s MHC molecules compared to a single defined peptide–MHC complex, (2) recruitment of lower affinity receptors, (3) antigen specificity conferred primarily by the alpha rather than the sequenced beta chain, or (4) nonspecific ‘bystander’ activation in the MIRA stimulation assay. From an experimental standpoint, MIRA offers the benefit of being able to identify TCRs associated with an antigen before a specific pMHC has been identified; however, the resultant diversity in antigen-associated TCRs recovered by MIRA poses a challenge for identifying relevant TCR motifs associated with multiple possible TCR:pMHC interactions. TCR biochemical neighborhood density is heterogeneous among set of antigen-associated TCRs We next investigated the proportion of unique TCRs with at least one biochemically similar neighbor among TCRs with the same putative antigen specificity. We and others have shown that a single peptide–MHC epitope is often recognized by many distinct TCRs with closely related amino acid sequences Dash et al., 2017; in fact, the detection of such clusters in bulk-sequenced repertoires is the basis of several existing tools: GLIPH (Glanville et al., 2017; Huang et al., 2020), ALICE (Pogorelyy et al., 2019), TCRNET (Ritvo et al., 2018), and RepAn (Yohannes et al., 2021). Therefore, to better understand sets of antigen-associated TCRs, like the SARS-CoV-2 MIRA data, we evaluated the neighborhood surrounding each TCR, defined as the set of similar TCRs whose sequence divergence is within a specified radius. The radius was measured using TCRdist, a position weighted, multi-CDR distance metric. Briefly, differences in the amino acid sequences of the CDRs are totaled based on the number of gaps (−4) and their BLOSUM62 substitution penalties (ranging from 0 to −4) with a default threefold weighting on CDR3 substitutions (see Methods for details of tcrdist3 reimplementation of TCRdist); a one amino acid mismatch in the CDR3 results in a maximal distance of 12 TCRdist units (tdus). As the radius about a TCR centroid expands, the number of TCRs it encompasses naturally increases. The increase was greater among the sets of antigen-associated TCRs compared to the ‘background’ repertoires that were not experimentally enriched for antigen-specific T cells (Figure 3). To better understand the relationship between the TCR distance radius and the density of proximal TCRs, we constructed empirical cumulative distribution functions (ECDFs) for each unique TCR (Figure 4). The ECDF for each unique TCR (each represented by one line in Figure 4) shows the proportion of all TCRs within the indicated radius; those with sparse neighborhoods appear as lines that remain low and do not increase along the y-axis even as the search radius expands (lines are hidden by the x-axis). The proportion of these TCRs with sparse or empty neighborhoods (ECDF proportion <0.001) is indicated by the height of the gray area plotted below the ECDF (Figure 4). We observed the highest density neighborhoods within repertoires that were sorted based on binding to a single peptide–MHC tetramer. For instance, with the influenza M1(GILGFVFTL)-A*02:01 tetramer-enriched repertoire from 15 subjects, we observed that many TCRs were concentrated in dense neighborhoods, which included as much as 30 % of the other influenza M1-recognizing TCRs within a radius of 12 tdus (Figure 4A). Notably there were also many TCRs with empty or sparse neighborhoods using a radius of 12 tdus (111/247, 44%) or 24 tdus (83/247, 34%). Based on previous work (Dash et al., 2017), we assume that the majority of these tetramer-sorted CD8+ T cells with few proximal neighbors do indeed bind the influenza M1:A*02:01 tetramer. This suggests that TCRs within sparse neighborhoods represent uncommon modes of antigen recognition and highlights the broad heterogeneity of neighborhood densities even among TCRs recognizing a single peptide–MHC. Figure 4 Download asset Open asset T-cell receptor (TCR) neighborhoods have higher density among TCRs that have been experimentally enriched for antigen-specific T cells compare to unenriched repertoires. TCR β-chains from (A) a peptide–major histocompatibility complex (MHC) tetramer-enriched subrepertoire (n = 247), (B) a MIRA peptide stimulation-enriched subrepertoire (n = 497), or (C) an umbilical cord blood unenriched repertoire (n = 9966), and (D) synthetically generated sequences using Optimized Likelihood estimate of Immunoglobulin Amino acid sequences (OLGA; n = 10,000; Sethna et al., 2019). Within each subrepertoire, an empirical cumulative distribution function (ECDF) was estimated for each TCR (one line) acting as the centroid of a neighborhood over a range of distance radii (x-axis). Each ECDF shows the proportion of TCRs within the set with a distance to the centroid less than the indicated radius. ECDF color corresponds to the length of the complementarity determining region (CDR)3-β loop. ECDF curves were randomly shifted by <1 unit along the x-axis to reduce overplotting. Vertical ECDF lines starting at 10−4 indicate no similar TCRs at or below that radius. Percentage of TCRs with an ECDF proportion <10−3 (bottom panels), indicates the percentage of TCRs without, or with very few biochemically similar neighbors at the given radius. Neighbor densities for individual TCRs within the MIRA sets were highly heterogeneous. Densities for an illustrative MIRA set are shown in Figure 5 (MIRA55:ORF1ab; 1316:1330 [amino acid]; peptide ALRKVPTDNYITTY). Within this antigen-associated repertoire, at 24 tdus 8.9 % (44/497) of TCR neighborhoods included >10% of the other antigen-activated CD8+ TCRs (Figure 5A). As expected, TCR neighborhoods in the umbilical cord blood repertoire were sparser (Figure 5B); the densest neighborhood included only 0.13 % of the repertoire at 24 tdus. We also noted that TCRs with sparse or empty neighborhoods tended to have longer CDR3 loops (Figure 4C) and lower generation of probability (pgen; Figure 5B). This is consistent with mathematical modeling that shows that TCRs with longer CDR3 loops have a lower pgen during genomic recombination of the TCR locus (Marcou et al., 2018; Murugan et al., 2012; Sethna et al., 2019). Absent strong selection for antigen recognition, longer TCRs with lower generation probabilities are thus likely to have less dense biochemical neighborhoods. Together, these observations suggest that biochemical neighborhood density is highly heterogeneous among TCRs and that it may depend on mechanisms of antigen recognition as well as receptor V(D)J recombination biases (Thomas and Crawford, 2019). Figure 5 Download asset Open asset Radius-defined neighborhood densities within an antigen-associated and a synthetic background repertoire. (A) Each T-cell receptor (TCR) (one line, n = 497) in the MIRA55 antigen-associated set acts as the centroid of a neighborhood and an empirical cumulative distribution function (ECDF) is estimated over a range of distance radii (x-axis). Each ECDF shows the proportion of TCRs within the MIRA set having a distance to the centroid less than the indicated radius. The ECDF line color corresponds to the TCR probability of generation (pgen) estimated using Optimized Likelihood estimate of Immunoglobulin Amino acid sequences (OLGA; Sethna et al., 2019). The ECDF curves are randomly shifted by <1 unit along the x-axis to reduce overplotting. The bottom panel shows the percentage of TCRs with an ECDF proportion <10−3. (B) Estimated ECDF for each MIRA55 TCR based on the proportion of TCRs in a synthetic background repertoire that are within the indicated radius (x-axis). A synthetic background was generated using 100,000 OLGA-generated TCRs and 100,000 TCRs subsampled from umbilical cord blood; OLGA-generated TCRs were sampled to match the V–J gene frequency in the MIRA 55 receptor set, with weighting to account for the sampling bias (see Methods for details). (C) Antigen-associated ECDF (y-axis) of one example TCR’s neighborhood (red line) plotted against ECDF within the synthetic background (x-axis). Example TCR neighborhood is the same indicated by the red line in (A) and (B). The dashed gray line indicates neighborhoods that are equally dense with TCRs from the antigen-associated and background subrepertoires. Annotations indicate the meta-clonotype radius for each data point in TCRdist units. Meta-clonotype radius can be tuned to balance sensitivity and specificity The utility of a TCR-based biomarker depends on the antigen specificity of the TCRs. Therefore, a limitation of distance-based clustering is the presence of similar TCR sequences that lack the ability to recognize the target antigen. To be useful, a meta-clonotype definition should be broad enough to capture multiple biochemically similar TCRs with shared antigen recognition, but not excessively broad as to include a high proportion of nonspecific TCRs. Statistically, we think of a meta-clonotype definition as a way to balance sensitivity and specificity, respectively, the ability to include antigen-recognizing TCRs and exclude nonspecific TCRs. Because the density of neighborhoods around TCRs are heterogeneous, we hypothesized that the optimal radius defining a meta-clonotype may differ for each TCR. To find the ideal radius we proposed comparing the relative density of a radius-defined neighborhood within a set of antigen-associated TCRs (Figure 5A) to the density of the radius-defined neighborhood within a background TCR repertoire (Figure 5B,C); here, the background repertoire can be any set of TCRs from antigen-naive repertoires. Though ideally this background comparator would explicitly exclude antigen-specific TCRs, we can use a nonselected repertoire as a background because the frequency of antigen-specific T cells in a large background drawn from antigen-naive donors is assumed to be low. Also, a nonselected background is a relevant comparator because it provides an estimate of the number of false detections we expect when each meta-clonotype is ultimately used to search for and quantify putatively antigen-specific sequences in bulk repertoires. An ideal radius would define a meta-clonotype with a high density of conformant sequences within a set of antigen-associated TCRs and a low density among a set of background TCRs. We demonstrate an algorithm for selecting an optimal radius for each TCR in the MIRA55:ORF1ab dataset, which includes TCRs from 15 COVID-19 diagnosed subjects (see Methods for details about MIRA and the immuneRACE dataset). First, an ECDF is constructed for each centroid TCR showing the relationship between the meta-clonotype radius and its sensitivity: its inclusion of similar antigen-recognizing TCRs, approximated by the proportion of TCRs in the antigen-associated MIRA set that are within the centroid’s radius (Figure 5A). Next, an ECDF is constructed for each TCR showing the relationship between the meta-clonotype radius and its specificity: its exclusion of TCRs with divergent antigen recognition, approximated by the proportion of TCRs in a background repertoire within the centroid’s radius (Figure 5B). The objective is to select the largest radius that includes no more than one in one-million background TCRs; while this threshold is arbitrary, practically it means that in a deeply sampled repertoire we expect to observe only a few TCRs within the radius and that deviation from this may indicate antigenic selection. Typically, to accurately estimate the frequency of a rare event, one would prefer to observe many such events and use the average; here, this would require having a background set of many millions of TCRs that would be used to evaluate every potential radius for each TCR centroid, presenting a substantial computational hurdle. We noted that, based on germline encoded CDR residues alone, much of the TCR background is too distant from a single TCR to be relevant and therefore realized that efficiency could be gained by focusing on TCRs that share the same TRBV and TRBJ genes. Therefore, for each set of antigen-associated TCRs identified using MIRA, we created a two part background. One part consisted of 100,000 synthetic TCRs whose TRBV- and TRBJ-gene frequencies matched those of the antigen-associated TCRs; TCRs were generated using the software OLGA with slight modification to allow V–J gene directed sequence generation (Marcou et al., 2018; Sethna et al., 2019). The other part consisted of 100,000 umbilical cord blood TCRs sampled from 8 subjects (Britanova et al., 2016). This composition balanced denser sampling of sequences near the candidate meta-clonotype centroids with broad sampling of TCRs from an antigen-naive repertoire. The dense sampling of TCRs with similar V–J combinations to the antigen-associated TCRs allowed for estimation of the overall frequency of meta-clonotype neighbors in the background well below 1 in 200,000. Conceptually, this is achievable because we oversampled the TCRs that were more likely to be within the meta-clonotype radius, therefore greatly increasing the statistical efficiency and precision with which we could estimate the overall frequency of meta-clonotype neighbors in the background. This idea is an adaptation of methods that are commonly used to adjust survey results when the sampling has known biases (Gelman, 2007). It is helpful to demonstrate the concept with an example: suppose we generate background TCRs (i.e., OLGA-generated) from one V–J gene combination, if we find that 5/50,000 (i.e., 10–4) TCRs are within a TCR’s radius, but that they are sampled from a V–J gene combination with 1% (i.e., 10–2) prevalence, the estimated frequency in the full background would be 1 in 1 million (10−4 × 10−2 = 10−6). Across all V–J gene defined strata in the synthetic background, the adjusted frequency PBG can be estimated as a weighted average, with V–J gene strata frequencies from the full background as the weights (wi): PBG=∑i=1NVJYiniwi where Yi is the number of TCRs within the radius of the centroid in the ith V–J gene defined strata and ni is the number of t

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call