Leveraging large family analyses for more accurate de novo mutation detection.
De novo mutations (DNMs), which arise in the offspring and are absent in the parents, are increasingly studied in farm animals with the advent of whole-genome sequencing (WGS). Variant calling after genome sequencing is a crucial step in modern genomics, and its accuracy directly influences subsequent genetic analyses, which are vital not only in breeding and human healthcare but also in functional genomic research. Yet, using only families with trios neglects the important shared information in the family and leads to an inaccurate determination of DNMs. Here, we show that using inheritance-based Whole Genome Sequencing (WGS) data analysis on a larger family is an effective way to identify DNMs in offspring accurately, and we present the first such study in rabbits.
- Research Article
- 10.1093/bib/bbaf543
- Nov 1, 2025
- Briefings in bioinformatics
De novo mutations (DNMs) are genetic alterations that occur for the first time in an offspring. DNMs have been found to be a significant cause of severe developmental disorders. With the widespread use of next-generation sequencing (NGS) technologies, accurate detection of DNMs is crucial. Several bioinformatics tools have been developed to call DNMs from NGS data, but no study to date has systematically compared these tools. We used both real whole genome sequencing (WGS) data from a trio from the 1000 Genomes Project (1000G) and an in-house simulated trio dataset to evaluate five DNM calling tools: DeNovoGear, TrioDeNovo, PhaseByTransmission, VarScan 2, and DeNovoCNN. For DNMs called in the real dataset, we observed 8.4% concordance of variants between all tools, while 83.8% of DNMs variants were identified by only one caller. For simulated trio WGS dataset spiked with 100 DNMs, the concordance rate was also low at 3.9%. DeNovoGear achieved the highest F1 score on the real 1000G dataset, while DeNovoCNN had the highest F1 score on the simulated data. Our study provides valuable recommendations for the selection and application of DNM callers on WGS trio data.
- Research Article
27
- 10.1161/circgenetics.113.000085
- Jul 14, 2013
- Circulation: Cardiovascular Genetics
Rapid advances in DNA sequencing technologies have made it increasingly cost-effective to obtain accurate and timely large-scale genomic sequence data on individuals (short read massively parallel or next generation [next-gen]). A next-gen molecular diagnostic approach that has seen rapid deployment in the clinic over the last year is exome sequencing. Whole exome sequencing covers all protein-coding genes in the genome (≈1.1% of genome), and an exome test for a single patient generates ≈6 gigabases (109 bp) of DNA sequence data. A key challenge facing routine use of next-gen data in patient diagnosis and management is data interpretation. What sequence variant findings are relevant to diagnosis (pathogenic mutations)? What sequence variant findings are relevant to clinical care but not necessarily to patient diagnosis (clinically actionable incidental data)? What sequence information should be stored, and where can it be stored? This review provides a tutorial on current approaches to answering these questions. A recent landmark study showed that application of next-gen sequencing to a large cohort of idiopathic dilated cardiomyopathy patients found ≈27% of patients to show mutations of the titin gene, the most complex gene in the genome (363 exons). We use titin in cardiomyopathy as an exemplar for explaining next-gen sequencing approaches and data interpretation. Decreasing sequencing costs and broad dissemination of next-generation (next-gen) equipment and expertise are increasing availability of massively parallel sequencing of patient DNA samples (short read massively parallel or next-gen sequencing).1,2 Most rapidly expanding is exome sequencing, where all protein-coding sequences (exons) are selected from total genomic DNA and selectively sequenced.3 Alternative approaches to next-gen sequencing include targeted sequencing (TS) and whole genome (complete genome) sequencing. Currently, marketed targeted Sanger sequencing panels using traditional individual exon-by-exon sequencing remain expensive and time consuming, and massively parallel next-gen approaches are beginning to supplant …
- Research Article
12
- 10.1007/s40484-016-0067-0
- May 1, 2016
- Quantitative Biology
Fundamental improvement was made for genome sequencing since the next‐generation sequencing (NGS) came out in the 2000s. The newer technologies make use of the power of massively‐parallel short‐read DNA sequencing, genome alignment and assembly methods to digitally and rapidly search the genomes on a revolutionary scale, which enable large‐scale whole genome sequencing (WGS) accessible and practical for researchers. Nowadays, whole genome sequencing is more and more prevalent in detecting the genetics of diseases, studying causative relations with cancers, making genome‐level comparative analysis, reconstruction of human population history, and giving clinical implications and instructions. In this review, we first give a typical pipeline of whole genome sequencing, including the lab template preparation, sequencing, genome assembling and quality control, variants calling and annotations. We compare the difference between whole genome and whole exome sequencing (WES), and explore a wide range of applications of whole genome sequencing for both mendelian diseases and complex diseases in medical genetics. We highlight the impact of whole genome sequencing in cancer studies, regulatory variant analysis, predictive medicine and precision medicine, as well as discuss the challenges of the whole genome sequencing.
- Research Article
11
- 10.1186/s12919-016-0040-y
- Oct 1, 2016
- BMC Proceedings
With the rapidly decreasing cost of the next-generation sequencing technology, a large number of whole genome sequences have been generated, enabling researchers to survey rare variants in the protein-coding and regulatory regions of the genome. However, it remains a daunting task to identify functional variants associated with complex diseases from whole genome sequencing (WGS) data because of the millions of candidate variants and yet moderate sample size. We propose to incorporate the Encyclopedia of DNA Elements (ENCODE) information in the association analysis of WGS data to boost the statistical power. We use the RegulomeDB and PolyPhen2 scores as external weights in existing rare variants association tests. We demonstrate the proposed framework using the WGS data and blood pressure phenotype from the San Antonio Family Studies provided by the Genetic Analysis Workshop 19. We identified a genome-wide significant locus in gene SNUPN on chromosome 15 that harbors a rare nonsynonymous variant, which was not detected by benchmark methods that did not incorporate biological information, including the T5 burden test and sequence kernel association test.
- Research Article
26
- 10.1136/jmedgenet-2014-102656
- Jan 16, 2015
- Journal of Medical Genetics
ObjectivesRecently, several studies documented that de novo mutations (DNMs) play important roles in the aetiology of sporadic diseases. Next-generation sequencing (NGS) enables variant calling at single-base resolution on a genome-wide...
- Research Article
46
- 10.1038/s42003-021-02777-9
- Nov 5, 2021
- Communications Biology
There is currently a dearth of accessible whole genome sequencing (WGS) data for individuals residing in the Americas with Sub-Saharan African ancestry. We generated whole genome sequencing data at intermediate (15×) coverage for 2,294 individuals with large amounts of Sub-Saharan African ancestry, predominantly Atlantic African admixed with varying amounts of European and American ancestry. We performed extensive comparisons of variant callers, phasing algorithms, and variant filtration on these data to construct a high quality imputation panel containing data from 2,269 unrelated individuals. With the exception of the TOPMed imputation server (which notably cannot be downloaded), our panel substantially outperformed other available panels when imputing African American individuals. The raw sequencing data, variant calls and imputation panel for this cohort are all freely available via dbGaP and should prove an invaluable resource for further study of admixed African genetics.
- Supplementary Content
9
- 10.4172/2469-9853.1000154
- Feb 13, 2019
- KTH Publication Database DiVA (KTH Royal Institute of Technology)
Whole exome sequencing (WES) has been extensively used in genomic research. As sequencing costs decline it is being replaced by whole genome sequencing (WGS) in large-scale genomic studies, but more comparative information on WES and WGS datasets would be valuable. Thus, we have extensively compared variant calls obtained from WGS and WES of matched germline DNA samples from 96 lung cancer patients. WGS provided more homogeneous coverage with higher genotyping quality, and identified more variants, than WES, regardless of exome coverage depth. It also called more reference variants, reflecting its power to call rare variants, and more heterozygous variants that met applied quality criteria, indicating that WGS is less prone to allelic drop outs. However, increasing WES coverage reduced the discrepancy between the WES and WGS results. We believe that as sequencing costs further decline WGS will become the method of choice even for research confined to the exome.
- Research Article
- 10.1097/01.ogx.0001179580.14616.46
- Jan 1, 2026
- Obstetrical & Gynecological Survey
Germline mutations are heritable; they occur before the formation of a fertilized egg and are found in all cells. They can be detected through somatic tissue sampling, and de novo mutations (DNMs) are well-studied. The majority of known DNMs originate in paternal cells, but some include maternal contributions as well. Certain kinds of DNMs prevent a fertilized egg from developing to term, and these are much less well-characterized. Some also lack sequence variants in certain genes and some are never observed in a homozygous form; these are also not well studied. Recombination failure can cause aneuploidies (trisomies or monosomies), and an estimated half of pregnancy losses are explained by this phenomenon. Early pregnancy loss is understudied, and there are few therapeutic interventions. This study, the Copenhagen Pregnancy Loss (COPL) study, was designed to contribute to the understanding of pregnancy loss through trios of patients (mother, father, and fetus) with clinically diagnosed pregnancy loss, attempting to document sequence diversity and interplay between meiotic recombination and point mutations. This study included 664 cases of early pregnancy loss with 1439 fetal samples (multiple were collected from each loss, where possible). In 467 of the 664 cases, there was at least 1 fetal sample and 1 sample from both parents. A total of 59 losses indicated a higher-than-expected kinship with the mother, and 11 indicated a higher kinship with the father. Whole-genome sequencing (WGS) was used to assess aneuploidies, and detected them in 206 cases. Of these, monosomy X and trisomy 16 were the most common. In addition, 19 large de novo copy number variants (CNVs) were detected in 14 loss cases, none of which were near a common fragile site. Of these 14 cases, 11 were euploid losses and 6 contained aneuploidies. Failure at meiosis I results in the presence of both homologous chromosomes from the same parent. An estimated 27.2% of paternal and 32.3% of maternal triploidies occur at recombination hotspots, supporting the idea of meiosis failure. A total of 15,086 DNMs were pinpointed as paternal and 5967 as maternal, consistent with previous literature supporting a high paternal contribution to DNMs. Consistent with this, paternal triploidies showed a proportionally higher paternal fraction of phased mutations. DNMs shown in maternal triploidies indicated a lower paternal fraction than euploid fetuses. In addition, there was no correlation between sister/homologous state differences for high-AB DNMs in paternal triploidies. When searching for pathogenic single-site variants (SSVs) in the DNMs, 26 genotypes were found that were pathogenic or likely pathogenic; a total of 23 were DNMs and 3 were biallelic predicted loss-of-function variants (pLoF). The frequency of pathogenic SSVs in early pregnancy loss was higher compared with controls [odds ratio (OR) 2.98, P =5.7×10 −6 ), and this effect remained after correction for parental age. These results showed probable genetic causes for pregnancy loss in 254 of 467 cases, including aneuploidies, triploidies, pathogenic SSVs, and de novo CNVs. Most of the genetic causes of loss originated on maternal chromosomes, and fetuses with triploidies had significantly more DNMs than fetuses that were euploid. These results indicate significant sequence diversity in early pregnancy loss, with additional diversity likely present but unidentified in the stages between implantation and clinically recognized pregnancy. Future research should focus on potential explanations for early pregnancy loss that cannot be explained by genetic causes, as well as on potential interventions for these cases.
- Research Article
21
- 10.1038/s41598-019-52614-7
- Nov 6, 2019
- Scientific reports
The success of next-generation sequencing depends on the accuracy of variant calls. Few objective protocols exist for QC following variant calling from whole genome sequencing (WGS) data. After applying QC filtering based on Genome Analysis Tool Kit (GATK) best practices, we used genotype discordance of eight samples that were sequenced twice each to evaluate the proportion of potentially inaccurate variant calls. We designed a QC pipeline involving hard filters to improve replicate genotype concordance, which indicates improved accuracy of genotype calls. Our pipeline analyzes the efficacy of each filtering step. We initially applied this strategy to well-characterized variants from the ClinVar database, and subsequently to the full WGS dataset. The genome-wide biallelic pipeline removed 82.11% of discordant and 14.89% of concordant genotypes, and improved the concordance rate from 98.53% to 99.69%. The variant-level read depth filter most improved the genome-wide biallelic concordance rate. We also adapted this pipeline for triallelic sites, given the increasing proportion of multiallelic sites as sample sizes increase. For triallelic sites containing only SNVs, the concordance rate improved from 97.68% to 99.80%. Our QC pipeline removes many potentially false positive calls that pass in GATK, and may inform future WGS studies prior to variant effect analysis.
- Research Article
27
- 10.1093/nar/gkac511
- Jun 17, 2022
- Nucleic Acids Research
De novo mutations (DNMs) are an important cause of genetic disorders. The accurate identification of DNMs from sequencing data is therefore fundamental to rare disease research and diagnostics. Unfortunately, identifying reliable DNMs remains a major challenge due to sequence errors, uneven coverage, and mapping artifacts. Here, we developed a deep convolutional neural network (CNN) DNM caller (DeNovoCNN), that encodes the alignment of sequence reads for a trio as 160}{} times164 resolution images. DeNovoCNN was trained on DNMs of 5616 whole exome sequencing (WES) trios achieving total 96.74% recall and 96.55% precision on the test dataset. We find that DeNovoCNN has increased recall/sensitivity and precision compared to existing DNM calling approaches (GATK, DeNovoGear, DeepTrio, Samtools) based on the Genome in a Bottle reference dataset and independent WES and WGS trios. Validations of DNMs based on Sanger and PacBio HiFi sequencing confirm that DeNovoCNN outperforms existing methods. Most importantly, our results suggest that DeNovoCNN is likely robust against different exome sequencing and analyses approaches, thereby allowing the application on other datasets. DeNovoCNN is freely available as a Docker container and can be run on existing alignment (BAM/CRAM) and variant calling (VCF) files from WES and WGS without a need for variant recalling.
- Research Article
14
- 10.1007/s10815-020-01921-4
- Aug 11, 2020
- Journal of Assisted Reproduction and Genetics
To explore a new preimplantation genetic testing (PGT) method for de novo mutations (DNMs) combined with chromosomal balanced translocations by whole-genome sequencing (WGS) using the MGISEQ-2000 sequencer. Two families, one with maternal Olmsted syndrome caused by DNM (c.1246C>T) in TRPV3 and a paternal Robertsonian translocation and one with paternal Marfan syndrome caused by DNM (c.4952_4955delAATG) in FBN1 and a maternal reciprocal translocation, underwent PGT for monogenetic disease (PGT-M), chromosomal aneuploidy, and structural rearrangement. WGS of embryos and family members were performed. Bioinformatics analysis based on gradient sequencing depth was performed, and parent-embryo haplotyping was conducted for DNM diagnosis. Sanger sequencing, karyotyping, and chromosomal microarray analysis were performed using an amniotic fluid sample to confirm the PGT results. After 1 PGT cycle, WGS of 2 embryos from the Olmsted syndrome family revealed euploid embryos without DNMs; after 2cycles, the 11 embryos from the Marfan syndrome family showed only 1 normal embryo without DNM, copy number variations (CNVs), or aneuploidy. Moreover, 1 blastocyst from the Marfan syndrome family was transferred back to the uterus; the amniocentesis test results were confirmed by PGT and a healthy infant was born. WGS based on parent-embryo haplotypes was an effective strategy for PGT of DNMs combined with a chromosomal balanced translocation. Our results indicate this is a reliable and effective diagnostic method that is useful for clinical application in PGT of patients with DNMs.
- Research Article
1
- 10.1186/preaccept-1179619571327140
- Jan 1, 2014
- Genome Medicine
INDELs, especially those disrupting protein-coding regions of the genome, have been strongly associated with human diseases. However, there are still many errors with INDEL variant calling, driven by library preparation, sequencing biases, and algorithm artifacts. We characterized whole genome sequencing (WGS), whole exome sequencing (WES), and PCR-free sequencing data from the same samples to investigate the sources of INDEL errors. We also developed a classification scheme based on the coverage and composition to rank high and low quality INDEL calls. We performed a large-scale validation experiment on 600 loci, and find high-quality INDELs to have a substantially lower error rate than low-quality INDELs (7% vs. 51%). Simulation and experimental data show that assembly based callers are significantly more sensitive and robust for detecting large INDELs (>5 bp) than alignment based callers, consistent with published data. The concordance of INDEL detection between WGS and WES is low (53%), and WGS data uniquely identifies 10.8-fold more high-quality INDELs. The validation rate for WGS-specific INDELs is also much higher than that for WES-specific INDELs (84% vs. 57%), and WES misses many large INDELs. In addition, the concordance for INDEL detection between standard WGS and PCR-free sequencing is 71%, and standard WGS data uniquely identifies 6.3-fold more low-quality INDELs. Furthermore, accurate detection with Scalpel of heterozygous INDELs requires 1.2-fold higher coverage than that for homozygous INDELs. Lastly, homopolymer A/T INDELs are a major source of low-quality INDEL calls, and they are highly enriched in the WES data. Overall, we show that accuracy of INDEL detection with WGS is much greater than WES even in the targeted region. We calculated that 60X WGS depth of coverage from the HiSeq platform is needed to recover 95% of INDELs detected by Scalpel. While this is higher than current sequencing practice, the deeper coverage may save total project costs because of the greater accuracy and sensitivity. Finally, we investigate sources of INDEL errors (for example, capture deficiency, PCR amplification, homopolymers) with various data that will serve as a guideline to effectively reduce INDEL errors in genome sequencing.
- Research Article
1
- 10.1002/alz.065290
- Jun 1, 2023
- Alzheimer's & Dementia
BackgroundThe Genome Center for Alzheimer’s Disease (GCAD) coordinates the integration and meta‐analysis of all available Alzheimer’s disease (AD) relevant whole genome sequencing (WGS) data with the goal of identifying AD risk or protective genetic variants and eventual therapeutic targets. The WGS datasets are generated via the collaboration of scientists from the Alzheimer’s Disease Sequencing Project (ADSP) and GCAD. With the vision to minimize data heterogeneity, introduced by different sequencing protocols and machines, GCAD processes all samples using identical pipelines and performs quality assurance (QA) checks.MethodsRaw sequencing data (FASTQs or BAMs) were aligned to GRCh38/hg38 by BWA, and variant calling and joint genotyping were done by GATK. Furthermore, Smoove, Manta and Streka were applied to generate structural variant (SV) calls per sample. QA checks including sex, contamination and genotype concordance as well as the ADSP QC protocol were performed to evaluate the quality of samples and variants. To facilitate the access and usage of the big joint‐genotyped VCF files, we introduced a compact version for storing variant info and sample genotypes only.ResultsWe dropped 235 (1.3%) samples of poor coverage (<20x) or that failed QA checks, and we flagged 173 (1.0%) samples that were of borderline quality. As a result, the dataset (ADSP Release 3, 2021) includes 16,905 genomes from 17 diverse cohorts with 3 major ethnicities: 10,651 Non‐Hispanic Whites, 3,212 Hispanics and 2,874 African Americans. Data are deeply sequenced (average genome coverage: >30x). All samples’ CRAMs, gVCFs from GATK, and VCFs from the three SV callers were deposited into NIAGADS Data Sharing Service (DSS) (https://dss.niagads.org/) for public distribution. In addition, joint‐genotype VCFs are available in both compact and QC versions. This joint‐genotype VCF contains >206M bi‐allelic single‐nucleotide variants, 16M bi‐allelic indels and 28M multi‐allelic variants, with 96% of variants remaining after stringent QC.ConclusionThe ADSP and GCAD generate high quality genotype calls and SV calls. Currently the project is processing ∼37,000 WGS samples sequenced primarily through the ADSP Follow‐Up Study, which will contain a more ancestrally diverse set of populations. We anticipate this 2022 release will continue to benefit the research community studying AD genetics.
- Research Article
17
- 10.3389/fimmu.2022.906328
- Jun 30, 2022
- Frontiers in Immunology
BackgroundKnowledge of the genetic variation underlying Primary Immune Deficiency (PID) is increasing. Reanalysis of genome-wide sequencing data from undiagnosed patients with suspected PID may improve the diagnostic rate.MethodsWe included patients monitored at the Department of Infectious Diseases or the Child and Adolescent Department, Rigshospitalet, Denmark, for a suspected PID, who had been analysed previously using a targeted PID gene panel (457 PID-related genes) on whole exome- (WES) or whole genome sequencing (WGS) data. A literature review was performed to extend the PID gene panel used for reanalysis of single nucleotide variation (SNV) and small indels. Structural variant (SV) calling was added on WGS data.ResultsGenetic data from 94 patients (86 adults) including 36 WES and 58 WGS was reanalysed a median of 23 months after the initial analysis. The extended gene panel included 208 additional PID-related genes. Genetic reanalysis led to a small increase in the proportion of patients with new suspicious PID related variants of uncertain significance (VUS). The proportion of patients with a causal genetic diagnosis was constant. In total, five patients (5%, including three WES and two WGS) had a new suspicious PID VUS identified due to reanalysis. Among these, two patients had a variant added due to the expansion of the PID gene panel, and three patients had a variant reclassified to a VUS in a gene included in the initial PID gene panel. The total proportion of patients with PID related VUS, likely pathogenic, and pathogenic variants increased from 43 (46%) to 47 (50%), as one patient had a VUS detected in both initial- and reanalysis. In addition, we detected new suspicious SNVs and SVs of uncertain significance in PID candidate genes with unknown inheritance and/or as heterozygous variants in genes with autosomal recessive inheritance in 8 patients.ConclusionThese data indicate a possible diagnostic gain of reassessing WES/WGS data from patients with suspected PID. Reasons for the possible gain included improved knowledge of genotype-phenotype correlation, expanding the gene panel, and adding SV analyses. Future studies of genotype-phenotype correlations may provide additional knowledge on the impact of the new suspicious VUSs.
- Research Article
54
- 10.1038/srep02161
- Jul 8, 2013
- Scientific Reports
The recent development of massively parallel sequencing technology has allowed the creation of comprehensive catalogs of genetic variation. However, due to the relatively high sequencing error rate for short read sequence data, sophisticated analysis methods are required to obtain high-quality variant calls. Here, we developed a probabilistic multinomial method for the detection of single nucleotide variants (SNVs) as well as short insertions and deletions (indels) in whole genome sequencing (WGS) and whole exome sequencing (WES) data for single sample calling. Evaluation with DNA genotyping arrays revealed a concordance rate of 99.98% for WGS calls and 99.99% for WES calls. Sanger sequencing of the discordant calls determined the false positive and false negative rates for the WGS (0.0068% and 0.17%) and WES (0.0036% and 0.0084%) datasets. Furthermore, short indels were identified with high accuracy (WGS: 94.7%, WES: 97.3%). We believe our method can contribute to the greater understanding of human diseases.