NGSTroubleFinder: a tool for detection and quantification of contamination and kinship across human NGS data.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Quality control constitutes a critical component of any next-generation sequencing (NGS) pipeline; however, most existing pipelines emphasize technical quality assessment (e.g. read quality, alignment metrics, duplication rates) while overlooking other equally important dimensions, such as sample identity verification, contamination detection, kinship analysis, and metadata concordance. Detecting issues like cross-sample contamination and sample swaps is essential to control data integrity. Here, we present NGSTroubleFinder, a novel tool to detect cross-sample contamination in human whole-genome and whole-transcriptome sequencing data, sample swaps, and mismatches between the reported and the inferred genetic and transcriptomic sexes. It can be run directly on BAM/CRAM files without requiring additional variant-calling steps and offers an integrated pipeline for ensuring quality control on NGS data, generated particularly within the context of clinical studies or research projects involving family members. It produces a detailed report that combines the results of its multiple analyses, including kinship, sex prediction, and contamination metrics. The tool reports extensive information on the samples, both in textual and HTML formats, including key plots for easy interpretation of the results. NGSTroubleFinder is written in Python and incorporates a custom-built parallelized pileup engine written in C, and it can be easily installed with pip. The tool source code and the models are freely available on GitHub (https://github.com/STALICLA-RnD/NGSTroubleFinder), and a containerized version is available on Docker Hub (https://hub.docker.com/r/staliclarnd/ngstroublefinder).

Similar Papers
  • Research Article
  • 10.1158/1538-7755.disp13-c04
Abstract C04: Comparative analysis of bioinformatic tools for the detection of viral DNA sequences in tumor cells
  • Nov 1, 2014
  • Cancer Epidemiology, Biomarkers & Prevention
  • Barbara Swanson + 3 more

Cancer health disparities exist among different ethnicities and races. Various factors such as lifestyle, environmental exposure, genetics, and epigenetics are thought to play a role in the existence of these disparities. Viruses, which outnumber cells in the human body by 100-fold, have a major impact on human health. Viruses have been estimated to cause 15-20% of human cancers, and we expect that some of these oncogenic viruses may have the potential to impact health disparities. Next generation sequencing (NGS) technologies are being used to investigate novel virus-cancer associations and interactions, and several bioinformatics tools for the detection and analysis of virus sequences in human NGS data have recently become available. These tools have not been validated, however, for use in human cancers with extremely low levels of viral sequences. In this study, we have compared the sensitivity and specificity of READSCAN to a manually constructed analysis workflow in a common, commercial NGS web application. READSCAN is a freely available software application that utilizes automated “digital subtraction” to eliminate host reads and identify specific virus sequences in NGS data. The functionality of this tool was compared to a workflow constructed within the web interface for the commercially available NGS software Partek. This constructed workflow first subtracted human sequences and then aligned the remaining sequences against select viral genomes. The sensitivity of the various tools was compared using NGS data obtained from human cytomegalovirus (HCMV) infected cells. Specificity of the tools was determined by analyzing the same data set against the Epstein-Barr virus (EBV) and Human Papilloma virus type 16 (HPV-16). Partek and READSCAN detected 69.66% and 60.54% of the input number of reads as HCMV sequences. Under these conditions, neither bioinformatics tool detected EBV or HPV16 specific reads in the HCMV infected cell data. This indicated that both Partek and READSCAN were capable of readily detecting the presence of large amounts of virus reads in NGS data from infected cells in a specific manner. In order to test the ability of these tools to detect viral specific reads in NGS data from tumors, data obtained from the HPV16 positive cervical squamous cell carcinoma cell line SiHa were analyzed. Both Partek and READSCAN detected six reads out of 969,798 total reads as HPV16 sequences; EBV or HCMV sequences were not detected. Our results are in agreement with previously published observations for this SiHa cell line, in which five HPV16 specific reads were detected. While we are still in the process of testing additional bioinformatics tools and configurations, our results attest that the open–source bioinformatics tool READSCAN and the commercially available Partek are comparable in performance, but differ in utility. Partek is faster, has a user-friendly interface, and a knowledge of Linux commands is not required. READSCAN lacks these features, but it is free and has an algorithm more clearly proscribed in the literature for digital subtraction. Next steps include utilizing a wider range of digital subtraction software, evaluating a broader range of NGS data from demographically diverse origins, and investigating discordance as it is found between alternative tools and configuration parameters. Ultimately, an ensemble of tools and configuration contexts is expected to yield critical information on the role of viruses in cancer and associated health disparities Citation Format: Barbara Swanson, Scott Harrison, Dukka KC, Perpetua Muganda. Comparative analysis of bioinformatic tools for the detection of viral DNA sequences in tumor cells. [abstract]. In: Proceedings of the Sixth AACR Conference: The Science of Cancer Health Disparities; Dec 6–9, 2013; Atlanta, GA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2014;23(11 Suppl):Abstract nr C04. doi:10.1158/1538-7755.DISP13-C04

  • Research Article
  • 10.1158/1538-7445.am2015-4309
Abstract 4309: Description of a scientific treatment approach of mast cell leukemia, an aggressive orphan hematologic disorder: strategy based on next-generation sequencing data
  • Aug 1, 2015
  • Cancer Research
  • Jeonghwan Youk + 7 more

Background Mast cell leukemia (MCL) is the most aggressive form of disorder among systemic mastocytosis. Due to its rarity, neither pathogenesis nor standard treatment is not established for this orphan disease. Hence, we tried to treat a patient with MCL based on the exome and transcriptome sequencing result of the patient's own DNA and RNA. Brief Case History and Results In October 2013, an 18-year-old Korean female were diagnosed as MCL after visiting our hospital due to left knee, ankle pain and inguinal lymphadenopathy. C-KIT overexpression was observed by immunohistochemistry. Whole exome sequencing result failed to demonstrate either noticeable single nucleotide variant (SNV) or copy number change. Interestingly, whole transcriptome sequencing (WTS) revealed mutation of KIT S476I, functionality of which is not known. Fusion analysis was performed using WTS data, possibility of RARα-B2M fusion has been arised. However, it was not validated by PCR sequencing. When RNA expression analysis was performed using WTS data, upregulation of PIK3/AKT pathway, which is a downstream of KIT (BAD phosphorylation) and mTOR has been observed. For the treatment perspective, she failed to achieve complete remission after cytarabine and idarubicin chemotherapy. Based on our WES and WTS result, we first tried all-trans retinoic acid targeting RARα, which failed to demonstrate efficacy. Then, she received dasatinib targeting KIT, which showed transient response for 2 weeks. Now she is under everolimus targeting mTOR pathway and, further treatment with PI3K inhibitor is planned in case of disease progression. Conclusions We are demonstrating a case of orphan disease, where we used targeted approach using WES and WTS data of the patient. Final results of our treatment outcome will be uncovered shortly, and utility of this kind of approach is to be validated. Citation Format: Jeonghwan Youk, Youngil Koh, Ji-Won Kim, Dae-Yoon Kim, Woo June Jung, Kwang-Sung Ahn, Sung-Soo Yoon, Hye Lim Jung. Description of a scientific treatment approach of mast cell leukemia, an aggressive orphan hematologic disorder: strategy based on next-generation sequencing data. [abstract]. In: Proceedings of the 106th Annual Meeting of the American Association for Cancer Research; 2015 Apr 18-22; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Res 2015;75(15 Suppl):Abstract nr 4309. doi:10.1158/1538-7445.AM2015-4309

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 60
  • 10.1186/1756-0500-7-864
Comparison of insertion/deletion calling algorithms on human next-generation sequencing data
  • Jan 1, 2014
  • BMC Research Notes
  • Dalia H Ghoneim + 3 more

BackgroundInsertions/deletions (indels) are the second most common type of genomic variant and the most common type of structural variant. Identification of indels in next generation sequencing data is a challenge, and algorithms commonly used for indel detection have not been compared on a research cohort of human subject genomic data. Guidelines for the optimal detection of biologically significant indels are limited. We analyzed three sets of human next generation sequencing data (48 samples of a 200 gene target exon sequencing, 45 samples of whole exome sequencing, and 2 samples of whole genome sequencing) using three algorithms for indel detection (Pindel, Genome Analysis Tool Kit's UnifiedGenotyper and HaplotypeCaller).ResultsWe observed variation in indel calls across the three algorithms. The intersection of the three tools comprised only 5.70% of targeted exon, 19.52% of whole exome, and 14.25% of whole genome indel calls. The majority of the discordant indels were of lower read depth and likely to be false positives. When software parameters were kept consistent across the three targets, HaplotypeCaller produced the most reliable results. Pindel results did not validate well without adjustments to parameters to account for varied read depth and number of samples per run. Adjustments to Pindel's M (minimum support for event) parameter improved both concordance and validation rates. Pindel was able to identify large deletions that surpassed the length capabilities of the GATK algorithms.ConclusionsDespite the observed variability in indel identification, we discerned strengths among the individual algorithms on specific data sets. This allowed us to suggest best practices for indel calling. Pindel's low validation rate of indel calls made in targeted exon sequencing suggests that HaplotypeCaller is better suited for short indels and multi-sample runs in targets with very high read depth. Pindel allows for optimization of minimum support for events and is best used for detection of larger indels at lower read depths.Electronic supplementary materialThe online version of this article (doi:10.1186/1756-0500-7-864) contains supplementary material, which is available to authorized users.

  • Preprint Article
  • 10.7490/f1000research.1110812.1
Comparing algorithms to genotype short tandem repeats in next-generation sequencing data
  • Oct 15, 2015
  • Harriet Dashnow + 1 more

Short tandem repeats (STRs) are short (2-6bp) DNA sequences repeated in tandem, which make up approximately 3% of the human genome. These loci are prone to frequent mutations and high polymorphism. Dozens of neurological and developmental disorders have been attributed to STR expansions. STRs have also been implicated in a range of functions such as DNA replication and repair, chromatin organisation and regulation of gene expression. Traditionally, STR variation has been measured using capillary gel electrophoresis. This process is time-consuming and expensive, and so has tended to limit STR analysis to a handful of loci. Next-generation sequencing has the potential to address these problems. However, determining STR lengths using next-generation sequencing data is difficult. For example, many callers are limited by sequencing read lengths and polymerase slippage during PCR amplification introduces stutter noise. Recently, a small number of software tools have been developed genotype STRs in next-generation sequencing data. We have performed a general comparison of the tools published to date, identifying their application domains, assumptions and limitations. We have assessed the performance of some of the most popular STR genotyping tools on human next-generation sequencing data. When comparing STR callers we have observed drastic differences in which STR loci are identified as variant. Surprisingly, even for variant loci reported in common between tools, there is markedly low concordance between the specific genotype calls. Finally, we draw together our findings to comment on the considerations when choosing and running an STR genotyping tool, with an emphasis on applications to human disease.

  • Preprint Article
  • 10.7490/f1000research.1110901.1
Comparing algorithms to genotype short tandem repeats in next-generation sequencing data
  • Oct 29, 2015
  • F1000Research
  • Harriet Dashnow + 1 more

Short tandem repeats (STRs) are short (2-6bp) DNA sequences repeated in tandem, which make up approximately 3% of the human genome. These loci are prone to frequent mutations and high polymorphism. Dozens of neurological and developmental disorders have been attributed to STR expansions. STRs have also been implicated in a range of functions such as DNA replication and repair, chromatin organisation and regulation of gene expression. Traditionally, STR variation has been measured using capillary gel electrophoresis. This process is time-consuming and expensive, and so has tended to limit STR analysis to a handful of loci. Next-generation sequencing has the potential to address these problems. However, determining STR lengths using next-generation sequencing data is difficult. For example, many callers are limited by sequencing read lengths and polymerase slippage during PCR amplification introduces stutter noise. Recently, a small number of software tools have been developed genotype STRs in next-generation sequencing data. We have performed a general comparison of the tools published to date, identifying their application domains, assumptions and limitations. We have assessed the performance of some of the most popular STR genotyping tools on human next-generation sequencing data. When comparing STR callers we have observed drastic differences in which STR loci are identified as variant. Surprisingly, even for variant loci reported in common between tools, there is markedly low concordance between the specific genotype calls. Finally, we draw together our findings to comment on the considerations when choosing and running an STR genotyping tool, with an emphasis on applications to human disease.

  • Research Article
  • Cite Count Icon 6
  • 10.5045/br.2016.51.1.17
A scientific treatment approach for acute mast cell leukemia: using a strategy based on next-generation sequencing data
  • Mar 1, 2016
  • Blood research
  • Jeonghwan Youk + 11 more

BackgroundMast cell leukemia (MCL) is the most aggressive form of systemic mastocytosis disorders. Owing to its rarity, neither pathogenesis nor standard treatment is established for this orphan disease. Hence, we tried to treat a patient with MCL based on the exome and transcriptome sequencing results of the patient's own DNA and RNA.MethodsFirst, tumor DNA and RNA were extracted from bone marrow at the time of diagnosis. Germline DNA was extracted from the patient's saliva 45 days after induction chemotherapy and used as a control. Then, we performed whole-exome sequencing (WES) using the DNA and whole transcriptome sequencing (WTS) using the RNA. Single nucleotide variants (SNVs) were called using MuTect and GATK. Samtools, FusionMap, and Gene Set Enrichment Analysis were utilized to analyze WTS results.ResultsWES and WTS results revealed mutation in KIT S476I. Fusion analysis was performed using WTS data, which suggested a possible RARα-B2M fusion. When RNA expression analysis was performed using WTS data, upregulation of PIK3/AKT pathway, downstream of KIT and mTOR, was observed. Based on our WES and WTS results, we first administered all-trans retinoic acid, then dasatinib, and finally, an mTOR inhibitor.ConclusionWe present a case of orphan disease where we used a targeted approach using WES and WTS data of the patient. Even though our treatment was not successful, use of our approach warrants further validation.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/life12101583
Human Retrotransposons and Effective Computational Detection Methods for Next-Generation Sequencing Data.
  • Oct 12, 2022
  • Life
  • Haeun Lee + 3 more

Transposable elements (TEs) are classified into two classes according to their mobilization mechanism. Compared to DNA transposons that move by the “cut and paste” mechanism, retrotransposons mobilize via the “copy and paste” method. They have been an essential research topic because some of the active elements, such as Long interspersed element 1 (LINE-1), Alu, and SVA elements, have contributed to the genetic diversity of primates beyond humans. In addition, they can cause genetic disorders by altering gene expression and generating structural variations (SVs). The development and rapid technological advances in next-generation sequencing (NGS) have led to new perspectives on detecting retrotransposon-mediated SVs, especially insertions. Moreover, various computational methods have been developed based on NGS data to precisely detect the insertions and deletions in the human genome. Therefore, this review discusses details about the recently studied and utilized NGS technologies and the effective computational approaches for discovering retrotransposons through it. The final part covers a diverse range of computational methods for detecting retrotransposon insertions with human NGS data. This review will give researchers insights into understanding the TEs and how to investigate them and find connections with research interests.

  • Research Article
  • 10.1158/1538-7445.sabcs21-p2-01-15
Abstract P2-01-15: Developing highly sensitive high NGS data efficient ctDNA detection assays for breast cancer surveillance
  • Feb 15, 2022
  • Cancer Research
  • Aihua Fu + 8 more

Introduction: Growing data established the importance of monitoring dynamic changes in circulating tumor DNA (ctDNA) to identify early signs of therapeutic responses, allowing for timely management of treatment to achieve more effective personalized therapy. Higher assay accuracy and consistency, and lower assay cost will support more clinical validation trials and benefit more cancer patients with non-invasive ctDNA NGS tests that can simultaneously map multiple genomic alterations at an affordable price. Method: The NVIGEN X - Precision Cancer Profiling test is a next generation sequencing (NGS) based circulating tumor DNA detection assay using the hybridization capture approach with customized gene panels. Our ctDNA NGS assay was developed with the use of high performance magnetic nanobeads, which enhances assay workflow at key steps including cfDNA extraction, NGS library preparation, and target enrichment. Experiments with individual plasma samples and DNA mutant fragments spiked in plasma samples were carried out to establish the assay performance such as sensitivity, specificity, consistency and data efficiency. NGS data QC metrics of the NVIGEN assay were compared with other assays in peer reviewed publications. Results: We developed a focus 32 gene panel that covers 144 kb of gene regions of clinical significance for breast cancer treatment monitoring and guidance, such as AKT1, ERBB2, PIK3CA, EGFR, ESR1, BRCA1/2, and CD274. Our results demonstrated the capability of NVIGEN X ctDNA NGS assay to detect rare copies (8 cp) of gene mutation at 0.07% MAF from DNA mutant fragments spiked into plasma samples. The NVIGEN X ctDNA NGS assays consistently presented 2-5% duplication rate, >80% on-target rate, <10% CV for key NGS data metrics, and on average required 1.36X paired reads per 1X unique coverage. Compared with the Roche Avenio assays (targeted, expanded and surveillance panels) as published in 2020 which on average required 9.36X paired reads per 1X unique coverage, the NVIGEN X -precision cancer profiling assays demonstrated 85% reduction in NGS data need to generate each unique coverage. Compared with the original Capp-seq data as published in the 2014 Nature Medicine paper which required in average 13.78 or 27.56 paired reads per unique coverage, the NVIGEN X assay demonstrated >90% reduction in NGS data need per unique coverage. Conclusion: The NVIGEN X - Precision Cancer Profiling assay provided high NGS assay performance with high sensitivity, specificity, and consistency, and significantly improved NGS data efficiency. This allows for dramatically reduced assay cost and will help support routine applications of ctDNA NGS tests to improve cancer patient treatment. Experiments of applying NVIGEN X assays for clinical research with patient samples are ongoing and will be presented. Citation Format: Aihua Fu, Wenwu Cui, Minh V. Ton, Kevan Wang, Weiwei Gu, Tianhong Li, Heather A. Parsons, Minetta C. Liu, George W. Sledge. Developing highly sensitive high NGS data efficient ctDNA detection assays for breast cancer surveillance [abstract]. In: Proceedings of the 2021 San Antonio Breast Cancer Symposium; 2021 Dec 7-10; San Antonio, TX. Philadelphia (PA): AACR; Cancer Res 2022;82(4 Suppl):Abstract nr P2-01-15.

  • Book Chapter
  • 10.1007/978-3-030-17938-0_23
Integrated Detection of Copy Number Variation Based on the Assembly of NGS and 3GS Data
  • Jan 1, 2019
  • Feng Gao + 2 more

The genomic coverage of copy number variations (CNVs) ranges from 5% to 10%, which is one of the essential pathogenic factors of human diseases. The detection of large CNVs is still defective. However, the read length of the third-generation sequencing (3GS) data is longer than that of the next-generation sequencing (NGS) data, which can theoretically solve the defect that the long variation can’t be detected. However, due to the low accuracy of the 3GS data, it is difficult to apply in practice. To a large extent, it is a supplement to the NGS data research. To solve these problems, we developed a new mutation detection tool named AssCNV23 in this paper. Firstly, this tool corrects the 3GS data to solve the problem of high error rate, and then combines the results of a variety of mutation detection tools to improve the accuracy of the initial mutation set and to solve the detection bias of a single detection tool. At the same time, the high-quality 3GS data was introduced by AssCNV23 to guide the NGS data to assemble, and then detects the CNV after getting enough length data. Finally, to improve the detection efficiency, the tool generates images containing the sequence depth information based on the read depth strategy and uses the convolutional neural network to detect the existing CNVs. The experimental results show that AssCNV23 guarantees a high level of breakpoint accuracy and performs well in identifying large variation. Compared with other tools, the deep learning model has advantages in accuracy and sensitivity, and Matthew correlation coefficient (MCC) performs well in various experiments. This algorithm is relatively reliable.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 18
  • 10.1186/s12864-020-6455-x
Comparison of multiple algorithms to reliably detect structural variants in pears
  • Jan 20, 2020
  • BMC Genomics
  • Yueyuan Liu + 6 more

BackgroundStructural variations (SVs) have been reported to play an important role in genetic diversity and trait regulation. Many computer algorithms detecting SVs have recently been developed, but the use of multiple algorithms to detect high-confidence SVs has not been studied. The most suitable sequencing depth for detecting SVs in pear is also not known.ResultsIn this study, a pipeline to detect SVs using next-generation and long-read sequencing data was constructed. The performances of seven types of SV detection software using next-generation sequencing (NGS) data and two types of software using long-read sequencing data (SVIM and Sniffles), which are based on different algorithms, were compared. Of the nine software packages evaluated, SVIM identified the most SVs, and Sniffles detected SVs with the highest accuracy (> 90%). When the results from multiple SV detection tools were combined, the SVs identified by both MetaSV and IMR/DENOM, which use NGS data, were more accurate than those identified by both SVIM and Sniffles, with mean accuracies of 98.7 and 96.5%, respectively. The software packages using long-read sequencing data required fewer CPU cores and less memory and ran faster than those using NGS data. In addition, according to the performances of assembly-based algorithms using NGS data, we found that a sequencing depth of 50× is appropriate for detecting SVs in the pear genome.ConclusionThis study provides strong evidence that more than one SV detection software package, each based on a different algorithm, should be used to detect SVs with higher confidence, and that long-read sequencing data are better than NGS data for SV detection. The SV detection pipeline that we have established will facilitate the study of diversity in other crops.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 12
  • 10.1016/j.jmoldx.2022.06.006
NGS4THAL, a One-Stop Molecular Diagnosis and Carrier Screening Tool for Thalassemia and Other Hemoglobinopathies by Next-Generation Sequencing
  • Jul 19, 2022
  • The Journal of molecular diagnostics : JMD
  • Yujie Cao + 15 more

Thalassemia is one of the most common genetic diseases and a major health threat worldwide. Accurate, efficient, and scalable analysis of next-generation sequencing (NGS) data is much needed for its molecular diagnosis and carrier screening. We developed NGS4THAL, a bioinformatics analysis pipeline analyzing NGS data to detect pathogenic variants for thalassemia and other hemoglobinopathies. NGS4THAL realigns ambiguously mapped NGS reads derived from the homologous Hb gene clusters for accurate detection of point mutations and small insertions/deletions. It uses a combination of complementary structural variant (SV) detection tools and an in-house database of control data containing specific SVs to achieve accurate detection of the complex SV types. Detected variants are matched with those in HbVar (A Database of Human Hemoglobin Variants and Thalassemia Mutations), allowing recognition of known pathogenic variants, including disease modifiers. Tested on simulation data, NGS4THAL achieved high sensitivity and specificity. For targeted NGS sequencing data from samples with laboratory-confirmed pathogenic Hb variants, it achieved 100% detection accuracy. Application of NGS4THAL on whole genome sequencing data from unrelated studies revealed thalassemia mutation carrier rates for Hong Kong Chinese and Northern Vietnamese that were consistent with previous reports. NGS4THAL is a highly accurate and efficient molecular diagnosis tool for thalassemia and other hemoglobinopathies based on tailored analysis of NGS data and may be scaled for population carrier screening.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 2
  • 10.1186/s12859-021-04535-4
ImmunoDataAnalyzer: a bioinformatics pipeline for processing barcoded and UMI tagged immunological NGS data
  • Jan 6, 2022
  • BMC Bioinformatics
  • Julia Vetter + 8 more

BackgroundNext-generation sequencing (NGS) is nowadays the most used high-throughput technology for DNA sequencing. Among others NGS enables the in-depth analysis of immune repertoires. Research in the field of T cell receptor (TCR) and immunoglobulin (IG) repertoires aids in understanding immunological diseases. A main objective is the analysis of the V(D)J recombination defining the structure and specificity of the immune repertoire. Accurate processing, evaluation and visualization of immune repertoire NGS data is important for better understanding immune responses and immunological behavior.ResultsImmunoDataAnalyzer (IMDA) is a pipeline we have developed for automatizing the analysis of immunological NGS data. IMDA unites the functionality from carefully selected immune repertoire analysis software tools and covers the whole spectrum from initial quality control up to the comparison of multiple immune repertoires. It provides methods for automated pre-processing of barcoded and UMI tagged immune repertoire NGS data, facilitates the assembly of clonotypes and calculates key figures for describing the immune repertoire. These include commonly used clonality and diversity measures, as well as indicators for V(D)J gene segment usage and between sample similarity. IMDA reports all relevant information in a compact summary containing visualizations, calculations, and sample details, all of which serve for a more detailed overview. IMDA further generates an output file including key figures for all samples, designed to serve as input for machine learning frameworks to find models for differentiating between specific traits of samples.ConclusionsIMDA constructs TCR and IG repertoire data from raw NGS reads and facilitates descriptive data analysis and comparison of immune repertoires. The IMDA workflow focus on quality control and ease of use for non-computer scientists. The provided output directly facilitates the interpretation of input data and includes information about clonality, diversity, clonotype overlap as well as similarity, and V(D)J gene segment usage. IMDA further supports the detection of sample swaps and cross-sample contamination that potentially occurred during sample preparation. In summary, IMDA reduces the effort usually required for immune repertoire data analysis by providing an automated workflow for processing raw NGS data into immune repertoires and subsequent analysis. The implementation is open-source and available on https://bioinformatics.fh-hagenberg.at/immunoanalyzer/.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 22
  • 10.1186/s12864-016-3449-9
INDELseek: detection of complex insertions and deletions from next-generation sequencing data
  • Jan 5, 2017
  • BMC Genomics
  • Chun Hang Au + 4 more

BackgroundComplex insertions and deletions (indels) from next-generation sequencing (NGS) data were prone to escape detection by currently available variant callers as shown by large-scale human genomics studies. Somatic and germline complex indels in key disease driver genes could be missed in NGS-based genomics studies.ResultsINDELseek is an open-source complex indel caller designed for NGS data of random fragments and PCR amplicons. The key differentiating factor of INDELseek is that each NGS read alignment was examined as a whole instead of “pileup” of each reference position across multiple alignments. In benchmarking against the reference material NA12878 genome (n = 160 derived from high-confidence variant calls), GATK, SAMtools and INDELseek showed complex indel detection sensitivities of 0%, 0% and 100%, respectively. INDELseek also detected all known germline (BRCA1 and BRCA2) and somatic (CALR and JAK2) complex indels in human clinical samples (n = 8). Further experiments validated all 10 detected KIT complex indels in a discovery cohort of clinical samples. In silico semi-simulation showed sensitivities of 93.7–96.2% based on 8671 unique complex indels in >5000 genes from dbSNP and COSMIC. We also demonstrated the importance of complex indel detection in accurately annotating BRCA1, BRCA2 and TP53 mutations with gained or rescued protein-truncating effects.ConclusionsINDELseek is an accurate and versatile tool for complex indel detection in NGS data. It complements other variant callers in NGS-based genomics studies targeting a wide spectrum of genetic variations.

  • Research Article
  • Cite Count Icon 50
  • 10.1261/rna.066910.118
DI-tector: defective interfering viral genomes’ detector for next-generation sequencing data
  • Jul 16, 2018
  • RNA
  • Guillaume Beauclair + 5 more

Defective interfering (DI) genomes, or defective viral genomes (DVGs), are truncated viral genomes generated during replication of most viruses, including live viral vaccines. Among these, “panhandle” or copy-back (cb) and “hairpin” or snap-back (sb) DI genomes are generated during RNA virus replication. 5′ cb/sb DI genomes are highly relevant for viral pathogenesis since they harbor immunostimulatory properties that increase virus recognition by the innate immune system of the host. We have developed DI-tector, a user-friendly and freely available program that identifies and characterizes cb/sb genomes from next-generation sequencing (NGS) data. DI-tector confirmed the presence of 5′ cb genomes in cells infected with measles virus (MV). DI-tector also identified a novel 5′ cb genome, as well as a variety of 3′ cb/sb genomes whose existence had not previously been detected by conventional approaches in MV-infected cells. The presence of these novel cb/sb genomes was confirmed by RT-qPCR and RT-PCR, validating the ability of DI-tector to reveal the landscape of DI genome population in infected cell samples. Performance assessment using different experimental and simulated data sets revealed the robust specificity and sensitivity of DI-tector. We propose DI-tector as a universal tool for the unbiased detection of DI viral genomes, including 5′ cb/sb DI genomes, in NGS data.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1186/s12864-024-10018-6
OTSUCNV: an adaptive segmentation and OTSU-based anomaly classification method for CNV detection using NGS data
  • Jan 30, 2024
  • BMC genomics
  • Kun Xie + 5 more

Copy-number variations (CNVs), which refer to deletions and duplications of chromosomal segments, represent a significant source of variation among individuals, contributing to human evolution and being implicated in various diseases ranging from mental illness and developmental disorders to cancer. Despite the development of several methods for detecting copy number variations based on next-generation sequencing (NGS) data, achieving robust detection performance for CNVs with arbitrary coverage and amplitude remains challenging due to the inherent complexity of sequencing samples. In this paper, we propose an alternative method called OTSUCNV for CNV detection on whole genome sequencing (WGS) data. This method utilizes a newly designed adaptive sequence segmentation algorithm and an OTSU-based CNV prediction algorithm, which does not rely on any distribution assumptions or involve complex outlier factor calculations. As a result, the effective detection of CNVs is achieved with lower computational complexity. The experimental results indicate that the proposed method demonstrates outstanding performance, and hence it may be used as an effective tool for CNV detection.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.