Population-level genomic analysis of immunoglobulin loci variation in rhesus macaques reveals extensive germline diversity.
Population-level genomic analysis of immunoglobulin loci variation in rhesus macaques reveals extensive germline diversity.
- Research Article
4
- 10.1101/2025.01.07.631319
- Mar 6, 2025
- bioRxiv : the preprint server for biology
Rhesus macaques (RMs) are a vital model for studying human disease and invaluable to pre-clinical vaccine research, particularly for the study of broadly neutralizing antibody responses. Such studies require robust genetic resources for antibody-encoding genes within the immunoglobulin (IG) loci. The complexity of the IG loci has historically made them challenging to characterize accurately. To address this, we developed novel experimental and computational methodologies to generate the largest collection to date of integrated antibody repertoire and long-read genomic sequencing data in 106 Indian origin RMs. We created a comprehensive resource of IG heavy and light chain variable (V), diversity (D), and joining (J) alleles, as well as leader, intronic, and recombination signal sequences (RSSs), including the curation of 1474 novel alleles, unveiling tremendous diversity, and expanding existing IG allele sets by 60%. This publicly available, continually updated resource (https://vdjbase.org/reference_book/Rhesus_Macaque) provides the foundation for advancing RM immunogenomics, vaccine discovery, and translational research.
- Research Article
1
- 10.1101/gr.278775.123
- Oct 21, 2024
- Genome research
Allelic variability in the adaptive immune receptor loci, which harbor the gene segments that encode B cell and T cell receptors (BCR/TCR), is of critical importance for immune responses to pathogens and vaccines. Adaptive immune receptor repertoire sequencing (AIRR-seq) has become widespread in immunology research making it the most readily available source of information about allelic diversity in immunoglobulin (IG) and T cell receptor (TR) loci. Here, we present a novel algorithm for extrasensitive and specific variable (V) and joining (J) gene allele inference, allowing the reconstruction of individual high-quality gene segment libraries. The approach can be applied for inferring allelic variants from peripheral blood lymphocyte BCR and TCR repertoire sequencing data, including hypermutated isotype-switched BCR sequences, thus allowing high-throughput novel allele discovery from a wide variety of existing data sets. The developed algorithm is a part of the MiXCR software. We demonstrate the accuracy of this approach using AIRR-seq paired with long-read genomic sequencing data, comparing it to a widely used algorithm, TIgGER. We applied the algorithm to a large set of IG heavy chain (IGH) AIRR-seq data from 450 donors of ancestrally diverse population groups, and to the largest reported full-length TCR alpha and beta chain (TRA and TRB) AIRR-seq data set, representing 134 individuals. This allowed us to assess the genetic diversity within the IGH, TRA, and TRB loci in different populations and to establish a database of alleles of V and J genes inferred from AIRR-seq data and their population frequencies with free public access through VDJ.online database.
- Research Article
21
- 10.3389/fimmu.2018.01687
- Jul 26, 2018
- Frontiers in Immunology
During adaptive immune responses, activated B cells expand and undergo somatic hypermutation of their B cell receptor (BCR), forming a clone of diversified cells that can be related back to a common ancestor. Identification of B cell clones from high-throughput Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data relies on computational analysis. Recently, we proposed an automated method to partition sequences into clonal groups based on single-linkage hierarchical clustering of the BCR junction region with length-normalized Hamming distance metric. This method could identify clonal sequences with high confidence on several benchmark experimental and simulated data sets. However, determining the threshold to cut the hierarchy, a key step in the method, is computationally expensive for large-scale repertoire sequencing data sets. Moreover, the methodology was unable to provide estimates of accuracy for new data. Here, a new method is presented that addresses this computational bottleneck and also provides a study-specific estimation of performance, including sensitivity and specificity. The method uses a finite mixture model fitting procedure for learning the parameters of two univariate curves which fit the bimodal distribution of the distance vector between pairs of sequences. These distributions are used to estimate the performance of different threshold choices for partitioning sequences into clones. These performance estimates are validated using simulated and experimental data sets. With this method, clones can be identified from AIRR-seq data with sensitivity and specificity profiles that are user-defined based on the overall goals of the study.
- Dissertation
- 10.14232/phd.10113
- Sep 5, 2019
Introduction The human cytomegalovirus (HCMV) is a ubiquitous herpesvirus and has a complex transcriptome. Polycistronism and alternative splicing make forming accurate transcript models particularly challenging. Long-read sequencing is a powerful nover tool that is able to distinguish between isoforms and discern a complex transcriptome. In order to gain a better insight into the transcriptional repertoire of the virus, we have sequenced the lytic HCMV transcriptome on multiple third-generation sequencing platforms. Our main objectives were to determine exon-connectivity, and to annotate the lytic transcriptome of the virus. In order to utilize the power of long-read sequencing, we have developed a pipeline that is suited for the analysis of long-read RNA sequencing data and is able to compare results obtained from different sequencing platforms. We also aimed to characterize the performance of each sequencing platform and library preparation method based on their ability to sequence full-length genuine transcripts. Materials and Methods Two biologically independent samples were sequenced. The first sample was subjected to cDNA sequencing on the Pacific Biosciences (PacBio) RSII and Sequel platforms as well as cDNA and dRNA sequencing on the Oxford Nanopore Technologies (ONT) MinION platform. The second sample was used for cap-selected cDNA sequencing on the MinION platform. The data were analysed using a custom pipeline utilizing the biopython and the pysam modules, and the bedtools software. Custom scripts were written to generate read statistics, characterize transcripts and to compare results. Results Over 80,000 cDNA reads were obtained from the two PacBio platforms and over 1,000,000 cDNA reads from the MinION platform. The direct RNA sequencing yielded 36,195 reads. The direct RNA sequencing reads were used to validate the cDNA sequencing results. We have created a pipeline for the analysis of long-read RNA sequencing data which accepts mapped sequencing reads produced by any long-read sequencing platform, and outputs a transcriptome annotation based on the sequenced reads. 440 isoforms were detected in our dataset. 377 of them were novel isoforms. The novel transcripts include TSS-, TES- or alternatively spliced isoforms of known genes, antisense transcripts and a novel intergenic transcript in the short repeat region. Many of the transcript isoforms only differed from each other in a few nucleotides, however, interestingly, most isoforms differed from each other in the combination of ORFs that they contained. Discussion Our results have more than doubled the number of annotated HCMV transcripts. Cross-platform validation gives these novel features high confidence. Using long-read RNA sequencing data we were able to draw a more detailed map of the HCMV transcriptome, which is instrumental both for the analysis of the viral gene expression and for understanding the molecular mechanisms of infection. Long-read RNA sequencing has discovered countless new isoforms in all the organisms for which it has been used. The biological function of most of these isoforms is currently unknown. However, our results show that many of the isoforms have distinct coding potentials, meaning that they code for different peptides of express upstream ORFs which may play a regulatory role during translation. With the headway of long-read sequencing technologies, the importance of bioinformatics tools that can analyse such data is increasing. We developed a pipeline which can rapidly process long-read RNA sequencing data from different platforms and create a transcriptome annotation which can be utilized by user with no bioinformatics background.
- Research Article
- 10.1101/2025.10.22.683890
- Oct 23, 2025
- bioRxiv
Antibodies, or immunoglobulins (IG), are central to the vertebrate adaptive immune system, yet the genomic architecture of IG loci remains poorly characterized in many nonhuman primates. In this study we present the first comprehensive genomic analysis of the immunoglobulin (IG) loci (IGH, IGL, and IGK) in two critically endangered orangutan species; Pongo abelii (Sumatran orangutan) and Pongo pygmaeus (Bornean orangutan) across multiple genome assemblies. Using IMGT-standardized biocuration framework combined with read-level structural validation, we identified previously undocumented haplotype-specific variation, including multigene duplications, asymmetric gene absence, and species-specific expansions of variable gene families. Recombination signal sequence (RSS) and switch region analyses revealed conserved regulatory motifs with potential implications for V(D)J recombination and class-switch recombination. These findings underscore the complexity and evolutionary adaptability of IG loci in great apes and highlight the value of orangutans as key references for understanding immune system evolution in the Hominidae lineage.
- Research Article
73
- 10.1093/nar/gkq391
- May 16, 2010
- Nucleic Acids Research
Recombination signal sequences (RSSs) flanking V, D and J gene segments are recognized and cut by the VDJ recombinase during development of B and T lymphocytes. All RSSs are composed of seven conserved nucleotides, followed by a spacer (containing either 12 ± 1 or 23 ± 1 poorly conserved nucleotides) and a conserved nonamer. Errors in V(D)J recombination, including cleavage of cryptic RSS outside the immunoglobulin and T cell receptor loci, are associated with oncogenic translocations observed in some lymphoid malignancies. We present in this paper the RSSsite web server, which is available from the address http://www.itb.cnr.it/rss. RSSsite consists of a web-accessible database, RSSdb, for the identification of pre-computed potential RSSs, and of the related search tool, DnaGrab, which allows the scoring of potential RSSs in user-supplied sequences. This latter algorithm makes use of probability models, which can be recasted to Bayesian network, taking into account correlations between groups of positions of a sequence, developed starting from specific reference sets of RSSs. In validation laboratory experiments, we selected 33 predicted cryptic RSSs (cRSSs) from 11 chromosomal regions outside the immunoglobulin and TCR loci for functional testing.
- Research Article
- 10.4049/jimmunol.206.supp.107.14
- May 1, 2021
- The Journal of Immunology
In mouse, the immunoglobulin (IG) loci span several Mb of the genome and contain hundreds of repeated, highly homologous sets of variable (V), diversity (D), and joining (J) genes that recombine in B cells to produce an individual’s expressed Ab repertoire. Similar to what has been reported in humans, recent data has shown significant levels of haplotype variation in the IG loci between commonly used inbred mouse strains, challenging the assumption that the IG loci are conserved across all strains of mice. We are using this intra-strain diversity to develop new models for studying the role of IG germline variation on Ab repertoire dynamics and function, questions that remain difficult to address in outbred human populations and pre-existing animal models. Given the diversity and complexity within the IG loci, the development of effective mouse models first requires the characterization of intra-strain differences and the construction of high-quality reference assemblies for the IG loci in several representative strains. To address this problem, we are using the Pacific Biosciences SMRT sequencing to sequence BAC clones spanning the IGH, IGK, and IGL loci in NOD/ShiLtJ and BALB/cByJ strains. We are also gaining additional insight on intra-strain diversity by profiling the expressed IGM, IGK, and IGL repertoires of 18 commonly used laboratory mouse strains. We have used this data to guide the construction of congenic lines on the C57BL/6 background, carrying divergent IGH or IGK loci from BALB/cByJ or NOD/ShiLtJ, respectively. Together, our data show significant germline diversity in our new IG assemblies, as well as divergent Ab repertoires in common lab mouse strains.
- Research Article
- 10.1186/s13059-025-03718-z
- Aug 20, 2025
- Genome Biology
Tandem repeat copy number variations (TR-CNVs), structural variations (SVs), and short indels have been responsible for many diseases and traits, but no tools exist to distinguish and detect these variants. In this study, we developed a computational tool, TRsv, to distinguish and detect TR-CNVs, SVs, and short indels using long reads. In evaluation with simulated and real datasets, TRsv outperformed existing tools for detection of TR-CNVs and indels and performed equally well for detection of SVs. We demonstrated genome-wide detection of TR-CNVs, including variants associated with gene expression, disease, and quantitative traits, using 160 long-read whole genome sequencing data and TRsv.Supplementary InformationThe online version contains supplementary material available at 10.1186/s13059-025-03718-z.
- Research Article
2
- 10.1002/ajp.70033
- Apr 1, 2025
- American journal of primatology
Rhesus and pigtail macaques are closely related and have similar social structures, yet differences in their behavior, socio-ecology, and personality have been observed, although not systematically documented. Given these differences, it is important to assess pigtail macaque cognition independently, rather than relying on rhesus macaque findings as a proxy. To gain a better understanding of pigtail macaque cognition, we used a battery of three cognitive tasks. Rhesus macaques were tested on the same tasks to validate our methods and to allow for comparison. Across just three tasks, we found significant differences between the two closely related species. In the three cups task, which tests short-term memory, both pigtail and rhesus macaques performed significantly better when they had to recall the location of a hidden food reward after a 0 s delay compared to a 15 s delay. However, in the 15 s delay condition, only rhesus macaques performed above chance levels, whereas pigtail macaques did not. In the reversal learning task, which tested rule learning and cognitive flexibility, we found species differences in learning performance. For the quantity discrimination task, which tests numerosity, we found that both rhesus and pigtail macaques were more accurate at discriminating "easy" ratios of foods (e.g., 1 vs. 5 or 2 vs. 6) than the "hard" ratios (e.g., 2 vs. 3 or 4 vs. 5). However, pigtail macaques were more accurate than rhesus macaques in the hard ratio trials. These contribute to a novel understanding of cognition in pigtail macaques while also increasing research rigor in translational research.
- Research Article
26
- 10.1007/s00251-002-0468-2
- Jun 14, 2002
- Immunogenetics
In order to facilitate molecular analysis of antibody responses in Rhesus monkeys ( Macaca mulatta), we used PCR techniques to clone and sequence the germline IGHD gene repertoire and the IGHD7- IGHJ6 locus in its entirety. We identified 30 distinct Rhesus DH genes belonging to seven subgroups and their recombination signal sequences that together share an average of 91% identity with their human counterparts, six potentially functional IGHJ genes and their recombination signal sequences that together share 93% identity with their human counterparts, as well as a novel IGHJ gene, IGHJ5 beta, which is a duplicated variant of IGHJ5. The presence, on average, of one additional IGHD gene in Rhesus IGHD subgroups when compared with human and one additional IGHJ gene suggests Rhesus has undergone at least two independent duplications beyond those that mark the human IGHD/IGHJ locus. Amino acid sequence composition is highly conserved between Rhesus and human, with IGHD insertions and deletions limited to three-nucleotide multiples, which serve to preserve enrichment for tyrosine, glycine, and serine residues in IGHD reading frame 1. The high degree of conservation between human and Rhesus IGHD and IGHJ genes supports the hypothesis that the germline repertoire encodes evolutionarily preferred antibody sequence as a result of selection for function.
- Research Article
3
- 10.1186/s12864-024-10632-4
- Jul 19, 2024
- BMC Genomics
At the 3’ end of the C2 gene in the mammalian TRB locus, a distinct reverse TRBV30 gene (named TRBV31 in mice) has been conserved throughout evolution. In the fully annotated TRB locus of 14 mammals (including six orders), we observed noteworthy variations in the localization and quality of the reverse V30 genes and Recombination Signal Sequences (RSSs) in the gene trees of 13 mammals. Conversely, the forward V29 genes and RSSs were generally consistent with the species tree of their corresponding species. This finding suggested that the evolution of the reverse V30 gene was not synchronous and likely played a crucial role in regulating adaptive immune responses. To further investigate this possibility, we utilized single-cell TCR sequencing (scTCR-seq) and high-throughput sequencing (HTS) to analyze TCRβ CDR3 repertoires from both central and peripheral tissues of Primates (Homo sapiens and Macaca mulatta), Rodentia (Mus musculus: BALB/c, C57BL/6, and Kunming mice), Artiodactyla (Bos taurus and Bubalus bubalis), and Chiroptera (Rhinolophus affinis and Hipposideros armige). Our investigation revealed several novel observations: (1) The reverse V30 gene exhibits classical rearrangement patterns adhering to the ‘12/23 rule’ and the ‘D-J rearrangement preceding the V-(D-J) rearrangement’. This results in the formation of rearranged V30-D2J2, V30-D1J1, and V30-D1J2. However, we also identified ‘special rearrangement patterns’ wherein V30-D rearrangement preceding D-J rearrangement, giving rise to rearranged V30-D2-J1 and forward Vx-D2-J. (2) Compared to the ‘deletional rearrangement’ (looping out) of forward V1-V29 genes, the reverse V30 gene exhibits preferential utilization with ‘inversional rearrangement’. This may be attributed to the shorter distance between the V30 gene and D gene and the ‘inversional rearrangement’ modes. In summary, in the mammalian TRB locus, the reverse V30 gene has been uniquely preserved throughout evolution and preferentially utilized in V(D)J recombination, potentially serving a significant role in adaptive immunity. These results will pave the way for novel and specialized research into the mechanisms, efficiency, and function of V(D)J recombination in mammals.
- Front Matter
4
- 10.1093/bioinformatics/btz804
- Oct 30, 2019
- Bioinformatics
SummaryAn effective immune system is characterized by a diverse immune repertoire. There is a strong demand for accurate and quantitative methods to assess the diversity of the immune repertoire for various (pre-)clinical applications, including the diagnosis and prognosis of primary immune deficiencies, or to assess the response to therapy. Current strategies for immune diversity assessment generally comprise the visual inspection of the length distribution of rearranged T- and B-cell receptors. Visual inspections, however, are prone to subjective assessments and thus lead to biases. Here, we introduce ImSpectR, a unified approach to quantify immunodiversity using either spectratype, repertoire sequencing or single cell RNA sequencing data. ImSpectR scores various types of deviations from the expected length distribution and integrates these into one measure, allowing for robust quantitative comparisons of immune diversity across individuals or conditions.Availability and implementationR-package is available for download on GitHub at https://github.com/martijn-cordes/ImSpectR.Supplementary informationSupplementary data are available at Bioinformatics online.
- Research Article
5
- 10.1093/bioinformatics/btad775
- Jan 2, 2024
- Bioinformatics
Motivationde novo variants (DNVs) are variants that are present in offspring but not in their parents. DNVs are both important for examining mutation rates as well as in the identification of disease-related variation. While efforts have been made to call DNVs, calling of DNVs is still challenging from parent–child sequenced trio data. We developed Hare And Tortoise (HAT) as an automated DNV detection workflow for highly accurate short-read and long-read sequencing data. Reliable detection of DNVs is important for human genomics and HAT addresses this need.ResultsHAT is a computational workflow that begins with aligned read data (i.e. CRAM or BAM) from a parent–child sequenced trio and outputs DNVs. HAT detects high-quality DNVs from Illumina short-read whole-exome sequencing, Illumina short-read whole-genome sequencing, and highly accurate PacBio HiFi long-read whole-genome sequencing data. The quality of these DNVs is high based on a series of quality metrics including number of DNVs per individual, percent of DNVs at CpG sites, and percent of DNVs phased to the paternal chromosome of origin.Availability and implementationhttps://github.com/TNTurnerLab/HAT
- Research Article
- 10.1158/1538-7445.am2025-5056
- Apr 21, 2025
- Cancer Research
Somatic SVs significantly contribute to cancer development and progression. Characterizing these variations has traditionally been difficult due to the limitations of short-read sequencing technologies and the diverse types and lengths of SVs. The advent of long-read sequencing has enabled a more comprehensive analysis of germline SVs, highlighting its potential applications in cancer genomics. The melanoma cell line COLO829, along with its normal counterpart COLO829BL, is a standard reference for somatic SV detection. Valle-Inclan et al. recently validated 68 somatic SVs combining short-read, long-read, and linked-read sequencing data. However, the sensitivity may be compromised as only a limited number of alignment-based callers were employed, such as NanoSV, Sniffles (for Oxford Nanopore Technologies (ONT)), and pbsv for (PacBio HiFi). As more alignment-based SV callers are being developed, it is crucial to integrate a broader range of tools to construct a comprehensive call set of somatic SVs. Moreover, the performance of assembly-based methods in somatic SV detection within cancer remains largely unexplored. In this study, we aimed to establish a comprehensive spectrum of somatic SVs in the COLO829 melanoma cell line by employing both alignment-based and assembly-based methods with long-read whole-genome sequencing data from PacBio and ONT aligned with Minimap2, SVs were identified and integrated from 7 alignment-based callers: cuteSV, NanoVar, DeBreak, Sniffles2, Svision-Pro, NanoSV, and pbsv. In addition, genome assemblies were constructed via Hifiasm and Verkko, incorporating HiFi reads with Hi-C and ultra-long ONT integration, followed by SV detection using four assembly-based callers: Dipcall, PAV, SVIM-asm, and cuteSV. A dedicated pipeline was developed to identify and filter somatic SVs from germline events, which considers mapping quality, sequencing depth of SV regions and their flanking regions, and 50% reciprocal overlap. Compared to the truth set established by Valle-Inclan et al., we discovered 19 novel somatic SVs, including 10 deletions, 4 insertions, 3 duplications, and 2 inversions. As a comparison, assembly-based methods detected all the deletions, insertions and duplications found by the alignment-based methods. However, it missed 20 inversions and 6 translocations. Our findings highlight the superior sensitivity of alignment-based techniques in somatic SV detection and underscore the complementary nature of assembly-based approaches. This work enhances the benchmark for somatic SV characterization in cancer genomics and confirms the utility of integrated long-read sequencing analyses. Citation Format: Zishan Peng, Zechen Chong. Full spectrum of somatic structural variations (SVs) detection in COLO829 with long-read sequencing [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 5056.
- Research Article
9
- 10.1186/s13073-024-01412-6
- Nov 25, 2024
- Genome Medicine
BackgroundMultidrug-resistant organisms (MDRO) pose a significant threat to public health worldwide. The ability to identify antimicrobial resistance determinants, to assess changes in molecular types, and to detect transmission are essential for surveillance and infection prevention of MDRO. Molecular characterization based on long-read sequencing has emerged as a promising alternative to short-read sequencing. The aim of this study was to characterize MDRO for surveillance and transmission studies based on long-read sequencing only.MethodsGenomic DNA of 356 MDRO was automatically extracted using the Maxwell-RSC48. The MDRO included 106 Klebsiella pneumoniae isolates, 85 Escherichia coli, 15 Enterobacter cloacae complex, 10 Citrobacter freundii, 34 Pseudomonas aeruginosa, 16 Acinetobacter baumannii, and 69 methicillin-resistant Staphylococcus aureus (MRSA), of which 24 were from an outbreak. MDRO were sequenced using both short-read (Illumina NextSeq 550) and long-read (Nanopore Rapid Barcoding Kit-24-V14, R10.4.1) whole-genome sequencing (WGS). Basecalling was performed for two distinct models using Dorado-0.3.2 duplex mode. Long-read data was assembled using Flye, Canu, Miniasm, Unicycler, Necat, Raven, and Redbean assemblers. Long-read WGS data with > 40 × coverage was used for multi-locus sequence typing (MLST), whole-genome MLST (wgMLST), whole-genome single-nucleotide polymorphisms (wgSNP), in silico multiple locus variable-number of tandem repeat analysis (iMLVA) for MRSA, and identification of resistance genes (ABRicate).ResultsComparison of wgMLST profiles based on long-read and short-read WGS data revealed > 95% of wgMLST profiles within the species-specific cluster cut-off, except for P. aeruginosa. The wgMLST profiles obtained by long-read and short-read WGS differed only one to nine wgMLST alleles or SNPs for K. pneumoniae, E. coli, E. cloacae complex, C. freundii, A. baumannii complex, and MRSA. For P. aeruginosa, differences were up to 27 wgMLST alleles between long-read and short-read wgMLST and 0–10 SNPs. MLST sequence types and iMLVA types were concordant between long-read and short-read WGS data and conventional MLVA typing. Antimicrobial resistance genes were detected in long-read sequencing data with high sensitivity/specificity (92–100%/99–100%). Long-read sequencing enabled analysis of an MRSA outbreak.ConclusionsWe demonstrate that molecular characterization of automatically extracted DNA followed by long-read sequencing is as accurate compared to short-read sequencing and suitable for typing and outbreak analysis as part of genomic surveillance of MDRO. However, the analysis of P. aeruginosa requires further improvement which may be obtained by other basecalling algorithms. The low implementation costs and rapid library preparation for long-read sequencing of MDRO extends its applicability to resource-constrained settings and low-income countries worldwide.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.