Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

A review of the pangenome: how it affects our understanding of genomic variation, selection and breeding in domestic animals?

  • Abstract
  • PDF
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

As large-scale genomic studies have progressed, it has been revealed that a single reference genome pattern cannot represent genetic diversity at the species level. While domestic animals tend to have complex routes of origin and migration, suggesting a possible omission of some population-specific sequences in the current reference genome. Conversely, the pangenome is a collection of all DNA sequences of a species that contains sequences shared by all individuals (core genome) and is also able to display sequence information unique to each individual (variable genome). The progress of pangenome research in humans, plants and domestic animals has proved that the missing genetic components and the identification of large structural variants (SVs) can be explored through pangenomic studies. Many individual specific sequences have been shown to be related to biological adaptability, phenotype and important economic traits. The maturity of technologies and methods such as third-generation sequencing, Telomere-to-telomere genomes, graphic genomes, and reference-free assembly will further promote the development of pangenome. In the future, pangenome combined with long-read data and multi-omics will help to resolve large SVs and their relationship with the main economic traits of interest in domesticated animals, providing better insights into animal domestication, evolution and breeding. In this review, we mainly discuss how pangenome analysis reveals genetic variations in domestic animals (sheep, cattle, pigs, chickens) and their impacts on phenotypes and how this can contribute to the understanding of species diversity. Additionally, we also go through potential issues and the future perspectives of pangenome research in livestock and poultry.

Similar Papers
  • Research Article
  • Cite Count Icon 32
  • 10.1186/s13073-024-01412-6
Genomic surveillance of multidrug-resistant organisms based on long-read sequencing
  • Nov 25, 2024
  • Genome Medicine
  • Fabian Landman + 59 more

BackgroundMultidrug-resistant organisms (MDRO) pose a significant threat to public health worldwide. The ability to identify antimicrobial resistance determinants, to assess changes in molecular types, and to detect transmission are essential for surveillance and infection prevention of MDRO. Molecular characterization based on long-read sequencing has emerged as a promising alternative to short-read sequencing. The aim of this study was to characterize MDRO for surveillance and transmission studies based on long-read sequencing only.MethodsGenomic DNA of 356 MDRO was automatically extracted using the Maxwell-RSC48. The MDRO included 106 Klebsiella pneumoniae isolates, 85 Escherichia coli, 15 Enterobacter cloacae complex, 10 Citrobacter freundii, 34 Pseudomonas aeruginosa, 16 Acinetobacter baumannii, and 69 methicillin-resistant Staphylococcus aureus (MRSA), of which 24 were from an outbreak. MDRO were sequenced using both short-read (Illumina NextSeq 550) and long-read (Nanopore Rapid Barcoding Kit-24-V14, R10.4.1) whole-genome sequencing (WGS). Basecalling was performed for two distinct models using Dorado-0.3.2 duplex mode. Long-read data was assembled using Flye, Canu, Miniasm, Unicycler, Necat, Raven, and Redbean assemblers. Long-read WGS data with > 40 × coverage was used for multi-locus sequence typing (MLST), whole-genome MLST (wgMLST), whole-genome single-nucleotide polymorphisms (wgSNP), in silico multiple locus variable-number of tandem repeat analysis (iMLVA) for MRSA, and identification of resistance genes (ABRicate).ResultsComparison of wgMLST profiles based on long-read and short-read WGS data revealed > 95% of wgMLST profiles within the species-specific cluster cut-off, except for P. aeruginosa. The wgMLST profiles obtained by long-read and short-read WGS differed only one to nine wgMLST alleles or SNPs for K. pneumoniae, E. coli, E. cloacae complex, C. freundii, A. baumannii complex, and MRSA. For P. aeruginosa, differences were up to 27 wgMLST alleles between long-read and short-read wgMLST and 0–10 SNPs. MLST sequence types and iMLVA types were concordant between long-read and short-read WGS data and conventional MLVA typing. Antimicrobial resistance genes were detected in long-read sequencing data with high sensitivity/specificity (92–100%/99–100%). Long-read sequencing enabled analysis of an MRSA outbreak.ConclusionsWe demonstrate that molecular characterization of automatically extracted DNA followed by long-read sequencing is as accurate compared to short-read sequencing and suitable for typing and outbreak analysis as part of genomic surveillance of MDRO. However, the analysis of P. aeruginosa requires further improvement which may be obtained by other basecalling algorithms. The low implementation costs and rapid library preparation for long-read sequencing of MDRO extends its applicability to resource-constrained settings and low-income countries worldwide.

  • Research Article
  • Cite Count Icon 7
  • 10.1111/jbg.12846
Genome-wide detection of structural variation in some sheep breeds using whole-genome long-read sequencing data.
  • Jan 21, 2024
  • Journal of Animal Breeding and Genetics
  • Guoyan Qiao + 5 more

Genomic structural variants (SVs) constitute a significant proportion of genetic variation in the genome. The rapid development of long-reads sequencing has facilitated the detection of long-fragment SVs. There is no published study to detect SVs using long-read data from sheep. We applied a long-read mapping approach to detect SVs and characterized a total of 30,771 insertions, deletions, inversions and translocations. We identified 716, 916, 842 and 303 specific SVs in Southdown sheep, Alpine merino sheep, Qilian White Tibetan sheep and Oula sheep, respectively. We annotated these SVs and found that these SV-related genes were primarily enriched in the well-established pathways involved in the regulation of the immune system, growth and development and environmental adaptability. We detected and annotated SVs based on NGS resequencing data to validate the accuracy based on third-generation detection. Moreover, five candidate SVs were verified using the PCR method in 50 sheep. Our study is the first to use a long-reads sequencing approach to construct a novel structural variation map in sheep. We have completed a preliminary exploration of the potential effects of SVs on sheep.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 169
  • 10.1186/s12864-015-1479-3
Assessing structural variation in a personal genome-towards a human reference diploid genome.
  • Apr 11, 2015
  • BMC Genomics
  • Adam C English + 30 more

BackgroundCharacterizing large genomic variants is essential to expanding the research and clinical applications of genome sequencing. While multiple data types and methods are available to detect these structural variants (SVs), they remain less characterized than smaller variants because of SV diversity, complexity, and size. These challenges are exacerbated by the experimental and computational demands of SV analysis. Here, we characterize the SV content of a personal genome with Parliament, a publicly available consensus SV-calling infrastructure that merges multiple data types and SV detection methods.ResultsWe demonstrate Parliament’s efficacy via integrated analyses of data from whole-genome array comparative genomic hybridization, short-read next-generation sequencing, long-read (Pacific BioSciences RSII), long-insert (Illumina Nextera), and whole-genome architecture (BioNano Irys) data from the personal genome of a single subject (HS1011). From this genome, Parliament identified 31,007 genomic loci between 100 bp and 1 Mbp that are inconsistent with the hg19 reference assembly. Of these loci, 9,777 are supported as putative SVs by hybrid local assembly, long-read PacBio data, or multi-source heuristics. These SVs span 59 Mbp of the reference genome (1.8%) and include 3,801 events identified only with long-read data. The HS1011 data and complete Parliament infrastructure, including a BAM-to-SV workflow, are available on the cloud-based service DNAnexus.ConclusionsHS1011 SV analysis reveals the limits and advantages of multiple sequencing technologies, specifically the impact of long-read SV discovery. With the full Parliament infrastructure, the HS1011 data constitute a public resource for novel SV discovery, software calibration, and personal genome structural variation analysis.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-015-1479-3) contains supplementary material, which is available to authorized users.

  • Research Article
  • Cite Count Icon 3
  • 10.1093/gbe/evaf173
Integrative Genotyping and Analysis of Canine Structural Variation Using Long-read and Short-read Data
  • Sep 13, 2025
  • Genome Biology and Evolution
  • Peter Z Schall + 1 more

Structural variation makes an important contribution to canine evolution and phenotypic differences. Although recent advances in long-read sequencing have enabled the generation of multiple canine genome assemblies, most prior analyses of structural variation have relied on short-read sequencing. To offer a more complete assessment of structural variation in canines, we performed an integrative analysis of structural variants present in 12 canine samples with available long-read and short-read sequencing data along with genome assemblies. Use of long-reads permits the discovery of heterozygous variation that is absent in existing haploid assembly representations while offering a marked increase in the ability to identify insertion variants relative to short-read approaches. Examination of the size spectrum of structural variants shows that dimorphic LINE-1 and SINE variants account for over 45% of all deletions and identified 1,410 LINE-1s with intact open reading frames that show presence–absence dimorphism. Using a graph-based approach, we genotype newly discovered structural variants in an existing collection of 1,879 resequenced dogs and wolves, generating a variant catalog containing a 56.5% increase in the number of deletions and 705% increase in the number of insertions previously found in the analyzed samples. Examination of allele frequencies across admixture components present across breed clades identified 283 structural variants evolving with a signature of selection.

  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41598-025-28283-0
Dark and camouflaged genomic regions remain challenging in CHM13
  • Jan 12, 2026
  • Scientific Reports
  • Mark E Wadsworth + 5 more

Comprehensive genomic analysis is essential for advancing our understanding of human genetics and disease. However, short-read sequencing technologies are inherently limited in their ability to resolve highly repetitive, structurally complex, and low-mappability genomic regions, previously coined as “dark” regions. Long-read sequencing technologies, such as PacBio and Oxford Nanopore Technologies (ONT), offer improved resolution of these regions, yet they are not perfect. With the advent of the new Telomere-to-Telomere (T2T) CHM13 reference genome, exploring its effect on dark regions is prudent. In this study, we systematically analyze dark regions across four human genome references—HG19, HG38 (with and without alternate contigs), and CHM13—using both short- and long-read sequencing data. We found that dark regions increase as the reference becomes more complete, especially dark-by-MAPQ regions, but that long-read sequencing significantly reduces the number of dark regions in the genome, particularly within gene bodies. However, we identify potential alignment challenges in long-read data, such as centromeric regions. These findings highlight the importance of both reference genome selection and sequencing technology choice in achieving a truly comprehensive genomic analysis.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-28283-0.

  • Research Article
  • Cite Count Icon 1
  • 10.1101/2025.05.23.655776
Sequencing the gaps: dark genomic regions persist in CHM13 despite long-read advances
  • May 28, 2025
  • bioRxiv
  • Mark E Wadsworth + 5 more

Comprehensive genomic analysis is essential for advancing our understanding of human genetics and disease. However, short-read sequencing technologies are inherently limited in their ability to resolve highly repetitive, structurally complex, and low-mappability genomic regions, previously coined as “dark” regions. Long-read sequencing technologies, such as PacBio and Oxford Nanopore Technologies (ONT), offer improved resolution of these regions, yet they are not perfect. With the advent of the new Telomere-to-Telomere (T2T) CHM13 reference genome, exploring its effect on dark regions is prudent. In this study, we systematically analyze dark regions across four human genome references—HG19, HG38 (with and without alternate contigs), and CHM13—using both short- and long-read sequencing data. We found that dark regions increase as the reference becomes more complete, especially dark-by-MAPQ regions, but that long-read sequencing significantly reduces the number of dark regions in the genome, particularly within gene bodies. However, we identify potential alignment challenges in long-read data, such as centromeric regions. These findings highlight the importance of both reference genome selection and sequencing technology choice in achieving a truly comprehensive genomic analysis.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.3389/fgene.2024.1435087
HapKled: a haplotype-aware structural variant calling approach for Oxford nanopore sequencing data.
  • Jul 9, 2024
  • Frontiers in genetics
  • Zhendong Zhang + 5 more

Introduction: Structural Variants (SVs) are a type of variation that can significantly influence phenotypes and cause diseases. Thus, the accurate detection of SVs is a vital part of modern genetic analysis. The advent of long-read sequencing technology ushers in a new era of more accurate and comprehensive SV calling, and many tools have been developed to call SVs using long-read data. Haplotype-tagging is a procedure that can tag haplotype information on reads and can thus potentially improve the SV detection; nevertheless, few methods make use of this information. In this article, we introduce HapKled, a new SV detection tool that can accurately detect SVs from Oxford Nanopore Technologies (ONT) long-read alignment data. Methods: HapKled utilizes haplotype information underlying alignment data by conducting haplotype-tagging using Whatshap on the reads to improve the detection performance, with three unique calling mechanics including altering clustering conditions according to haplotype information of signatures, determination of similar SVs based on haplotype information, and slack filtering conditions based on haplotype quality. Results: In our evaluations, HapKled outperformed state-of-the-art tools and can deliver better SV detection results on both simulated and real sequencing data. The code and experiments of HapKled can be obtained from https://github.com/CoREse/HapKled. Discussion: With the superb SV detection performance that HapKled can deliver, HapKled could be useful in bioinformatics research, clinical diagnosis, and medical research and development.

  • Research Article
  • 10.18805/ijar.b-1166
Advances of MSTN Genetic Markers in Domesticated Animals
  • Aug 31, 2020
  • Indian Journal of Animal Research
  • Cheng-Li Liu + 8 more

The myostatin (MSTN) gene is a negative regulator of animal muscle growth and development. This gene not only inhibits muscle cell growth and reduces fat accumulation but also exerts a significant effect on back fat thickness, birth weight and carcass traits. MSTN gene mutation, an important factor that influences economic traits, directly affects the growth and development of animals and consequently the quality of animal products. This paper reviews the structural and functional characteristics of the MSTN gene. The genetic variation of the MSTN gene is then compared among four domestic animals (cattle, sheep, goat and pig) and its correlation with important economic traits is analysed. The mechanism and structural characteristics of MSTN gene mutants are further discussed. This paper provides explication on the application of MSTN gene research in breeding and its importance to the advancement of animal husbandry.

  • Front Matter
  • Cite Count Icon 9
  • 10.1111/mec.16884
Long-read sequencing in ecology and evolution: Understanding how complex genetic and epigenetic variants shape biodiversity.
  • Mar 1, 2023
  • Molecular Ecology
  • Dan G Bock + 3 more

Ten years ago, the journal Molecular Ecology published a “road map” paper that reviewed past achievements in the discipline of molecular ecology, identified research challenges and charted a way forward (Andrew et al., 2013). That paper was motivated by a symposium organized during the First Joint Congress on Evolutionary Biology (Ottawa, July 6–10, 2012). In addition, it occurred on the heels of a major inflection point in molecular ecology and in life sciences more broadly: the development and uptake of “next”- or “second”-generation sequencing technologies, which deliver short DNA reads (typically shorter than 400 bp) at very high throughput (e.g., several billion reads per run; Goodwin et al., 2016). As such, Andrew et al. (2013) emphasized the promise of second-generation sequencing for diverse subdisciplines of molecular ecology such as phylogeography, landscape genomics, molecular adaptation and speciation. Representing more than just a technical advancement, second-generation sequencing was predicted to stimulate rapid conceptual breakthroughs in the field, especially in nonmodel species (Stapley et al., 2010; Tautz et al., 2010). As illustrated by any recent issue in the Molecular Ecology journal, these predictions were accurate.

  • Preprint Article
  • 10.1158/0008-5472.c.6514322.v1
Data from Gene Fusion Detection and Characterization in Long-Read Cancer Transcriptome Sequencing Data with FusionSeeker
  • Mar 31, 2023
  • Yu Chen + 6 more

<div>Abstract<p>Gene fusions are prevalent in a wide array of cancer types with different frequencies. Long-read transcriptome sequencing technologies, such as PacBio, Iso-Seq, and Nanopore direct RNA sequencing, provide full-length transcript sequencing reads, which could facilitate detection of gene fusions. In this work, we developed a method, FusionSeeker, to comprehensively characterize gene fusions in long-read cancer transcriptome data and reconstruct accurate fused transcripts from raw reads. FusionSeeker identified gene fusions in both exonic and intronic regions, allowing comprehensive characterization of gene fusions in cancer transcriptomes. Fused transcript sequences were reconstructed with FusionSeeker by correcting sequencing errors in the raw reads through partial order alignment algorithm. Using these accurate transcript sequences, FusionSeeker refined gene fusion breakpoint positions and predicted breakpoints at single bp resolution. Overall, FusionSeeker will enable users to discover gene fusions accurately using long-read data, which can facilitate downstream functional analysis as well as improved cancer diagnosis and treatment.</p>Significance:<p>FusionSeeker is a new method to discover gene fusions and reconstruct fused transcript sequences in long-read cancer transcriptome sequencing data to help identify novel gene fusions important for tumorigenesis and progression.</p></div>

  • Preprint Article
  • 10.1158/0008-5472.c.6514322
Data from Gene Fusion Detection and Characterization in Long-Read Cancer Transcriptome Sequencing Data with FusionSeeker
  • Mar 31, 2023
  • Yu Chen + 6 more

<div>Abstract<p>Gene fusions are prevalent in a wide array of cancer types with different frequencies. Long-read transcriptome sequencing technologies, such as PacBio, Iso-Seq, and Nanopore direct RNA sequencing, provide full-length transcript sequencing reads, which could facilitate detection of gene fusions. In this work, we developed a method, FusionSeeker, to comprehensively characterize gene fusions in long-read cancer transcriptome data and reconstruct accurate fused transcripts from raw reads. FusionSeeker identified gene fusions in both exonic and intronic regions, allowing comprehensive characterization of gene fusions in cancer transcriptomes. Fused transcript sequences were reconstructed with FusionSeeker by correcting sequencing errors in the raw reads through partial order alignment algorithm. Using these accurate transcript sequences, FusionSeeker refined gene fusion breakpoint positions and predicted breakpoints at single bp resolution. Overall, FusionSeeker will enable users to discover gene fusions accurately using long-read data, which can facilitate downstream functional analysis as well as improved cancer diagnosis and treatment.</p>Significance:<p>FusionSeeker is a new method to discover gene fusions and reconstruct fused transcript sequences in long-read cancer transcriptome sequencing data to help identify novel gene fusions important for tumorigenesis and progression.</p></div>

  • Abstract
  • 10.1016/j.gim.2022.01.180
EP144: Long-read genome sequencing secondary processing pipelines provide variant call accuracy that exceeds current clinical standards for short-read genome sequencing
  • Mar 1, 2022
  • Genetics in Medicine
  • James Holt + 6 more

eP144: Long-read genome sequencing secondary processing pipelines provide variant call accuracy that exceeds current clinical standards for short-read genome sequencing

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 75
  • 10.1038/s41467-019-12174-w
Long-read assembly of the Chinese rhesus macaque genome and identification of ape-specific structural variants
  • Sep 17, 2019
  • Nature Communications
  • Yaoxi He + 12 more

We present a high-quality de novo genome assembly (rheMacS) of the Chinese rhesus macaque (Macaca mulatta) using long-read sequencing and multiplatform scaffolding approaches. Compared to the current Indian rhesus macaque reference genome (rheMac8), rheMacS increases sequence contiguity 75-fold, closing 21,940 of the remaining assembly gaps (60.8 Mbp). We improve gene annotation by generating more than two million full-length transcripts from ten different tissues by long-read RNA sequencing. We sequence resolve 53,916 structural variants (96% novel) and identify 17,000 ape-specific structural variants (ASSVs) based on comparison to ape genomes. Many ASSVs map within ChIP-seq predicted enhancer regions where apes and macaque show diverged enhancer activity and gene expression. We further characterize a subset that may contribute to ape- or great-ape-specific phenotypic traits, including taillessness, brain volume expansion, improved manual dexterity, and large body size. The rheMacS genome assembly serves as an ideal reference for future biomedical and evolutionary studies.

  • Research Article
  • Cite Count Icon 3
  • 10.1093/nargab/lqae152
VILOCA: sequencing quality-aware viral haplotype reconstruction and mutation calling for short-read and long-read data.
  • Sep 28, 2024
  • NAR genomics and bioinformatics
  • Lara Fuhrmann + 3 more

RNA viruses exist as large heterogeneous populations within their host. The structure and diversity of virus populations affects disease progression and treatment outcomes. Next-generation sequencing allows detailed viral population analysis, but inferring diversity from error-prone reads is challenging. Here, we present VILOCA (VIral LOcal haplotype reconstruction and mutation CAlling for short and long read data), a method for mutation calling and reconstruction of local haplotypes from short- and long-read viral sequencing data. Local haplotypes refer to genomic regions that have approximately the length of the input reads. VILOCA recovers local haplotypes by using a Dirichlet process mixture model to cluster reads around their unobserved haplotypes and leveraging quality scores of the sequencing reads. We assessed the performance of VILOCA in terms of mutation calling and haplotype reconstruction accuracy on simulated and experimental Illumina, PacBio and Oxford Nanopore data. On simulated and experimental Illumina data, VILOCA performed better or similar to existing methods. On the simulated long-read data, VILOCA is able to recover on average [Formula: see text] of the ground truth mutations with perfect precision compared to only [Formula: see text] recall and [Formula: see text] precision of the second-best method. In summary, VILOCA provides significantly improved accuracy in mutation and haplotype calling, especially for long-read sequencing data, and therefore facilitates the comprehensive characterization of heterogeneous within-host viral populations.

  • Research Article
  • 10.1038/s41598-025-06096-5
Feasibility of long-read sequencing to identify molecular alterations in an Indonesian cohort of locally advanced to advanced nasopharyngeal cancer
  • Jul 1, 2025
  • Scientific Reports
  • Handoko + 8 more

Nasopharyngeal carcinoma (NPC) is prevalent in Southeast Asia, particularly in Indonesia. Despite advances in treatment, patients with advanced NPC face poor outcomes. Examining the NPC mutational landscape is crucial for understanding its biology and enable potential new therapeutic strategy. To characterize the landscape of single nucleotide variants (SNVs), structural variants (SVs), copy number variations (CNVs), and short tandem repeats (STRs) in locally advanced to advanced NPC within an Indonesian cohort using long-read sequencing. Six fresh-frozen nasopharyngeal biopsy samples were collected from the NPC biobank. DNA was extracted and sequenced using Oxford Nanopore’s Promethion 2 Solo long-read sequencer. Structural and small variants were identified and annotated. The SNVs, SVs, and CNVs were categorized based on predicted effects, and key findings were validated using external RNA-seq data. Copy number loss genes were checked against the Tumour Suppressor Gene database (TSGene v2.0). Genetic findings were correlated with patient clinical histories. Approximately 4.4 to 5.1 million SNVs were identified per sample, with 0.023% categorized as high consequence. Notable tumour suppressor genes, such as LIMD1 and CNDP2, were frequently mutated. Around 30,000 to 41,599 SVs were detected per sample. High-consequence tumour suppressor gene SVs were identified in EPHA3, CASP8, DMBT1, ZFHX3, and IRF5 gene. Common copy number tumour suppressor gene loss observed in RNH1, H19, CDKN1C, and others, suggesting their role in NPC carcinogenesis. Copy number gains were found in potential oncogenes such as Y RNA, LTO1, and FADD. Pathogenic short tandem repeats (STRs) in PABPN1 and RFC1 were identified in three samples, presenting a novel association with NPC. NPC sample which exhibited significant genomic instability had the shortest survival, potentially linked to multiple defective DNA repair genes. This study utilized long-read sequencing to identify a complex spectrum of genetic alterations, including numerous SVs and potentially pathogenic STRs, in Indonesian NPC. Extensive DNA repair gene defects, primarily complex SVs detectable by long reads, were observed and highly possibly associated with poor survival. These findings underscore the potential of long-read sequencing for uncovering clinically relevant mutations in NPC.

Save Icon
Up Arrow
Open/Close
Setting-up Chat
Loading Interface