Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

GCCVision: An integrated toolkit for calculating and visualizing parental genome contribution in breeding populations

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

GCCVision is an integrated Python-based toolkit that analyzes VCF files from biparental crosses to identify informative SNPs, calculate parental genome contributions, and generate customizable visualizations, thereby streamlining breeding decisions and accelerating crop improvement efforts.

Abstract
Translate article icon Translate Article Star icon

SummaryTracking parental genome contributions in segregating populations is crucial for accelerating genetic gain in plant breeding. We introduce GCCVision (Genome Contribution Calculator and Visualizer), an integrated bioinformatics toolkit to simplify this process. GCCVision uses an efficient Python-based backend and a user-friendly web-based frontend to analyze Variant Call Format (VCF) files from biparental crosses. The software identifies informative single-nucleotide polymorphisms (SNPs), calculates parental contribution rates, and generates clear, customizable graphical genotype maps where chromosome segments are color-coded by parental origin. By providing clear visualizations of genomic composition, GCCVision assists breeders in selection decisions for backcrossing, F2 analysis, quality control of hybrid seeds, and other breeding programs. This streamlined workflow shortens breeding cycles and accelerates the development of improved crop varieties.

Similar Papers
  • PDF Download Icon
  • Research Article
  • Cite Count Icon 79
  • 10.1534/g3.119.400129
Usefulness Criterion and Post-selection Parental Contributions in Multi-parental Crosses: Application to Polygenic Trait Introgression
  • Feb 28, 2019
  • G3: Genes|Genomes|Genetics
  • Antoine Allier + 4 more

Predicting the usefulness of crosses in terms of expected genetic gain and genetic diversity is of interest to secure performance in the progeny and to maintain long-term genetic gain in plant breeding. A wide range of crossing schemes are possible including large biparental crosses, backcrosses, four-way crosses, and synthetic populations. In silico progeny simulations together with genome-based prediction of quantitative traits can be used to guide mating decisions. However, the large number of multi-parental combinations can hinder the use of simulations in practice. Analytical solutions have been proposed recently to predict the distribution of a quantitative trait in the progeny of biparental crosses using information of recombination frequency and linkage disequilibrium between loci. Here, we extend this approach to obtain the progeny distribution of more complex crosses including two to four parents. Considering agronomic traits and parental genome contribution as jointly multivariate normally distributed traits, the usefulness criterion parental contribution (UCPC) enables to (i) evaluate the expected genetic gain for agronomic traits, and at the same time (ii) evaluate parental genome contributions to the selected fraction of progeny. We validate and illustrate UCPC in the context of multiple allele introgression from a donor into one or several elite recipients in maize (Zea mays L.). Recommendations regarding the interest of two-way, three-way, and backcrosses were derived depending on the donor performance. We believe that the computationally efficient UCPC approach can be useful for mate selection and allocation in many plant and animal breeding contexts.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 3
  • 10.12688/f1000research.109080.1
Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR.
  • Feb 24, 2022
  • F1000Research
  • Sebastian Beier + 14 more

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of (meta-) data in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. VCF files are an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant call data (for example, the HapMap format and the gVCF format), but none currently have the reach of VCF. In VCF, only the sites of variation are described, whereas in gVCF, all positions are listed, and confidence values are also provided. For the sake of simplicity, we will only discuss VCF and our recommendations for its use. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse (if any) descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from the plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.

  • Research Article
  • Cite Count Icon 6
  • 10.7717/peerj.11333
Re-Searcher: GUI-based bioinformatics tool for simplified genomics data mining of VCF files.
  • May 3, 2021
  • PeerJ
  • Daniyar Karabayev + 8 more

BackgroundHigh-throughput sequencing platforms generate a massive amount of high-dimensional genomic datasets that are available for analysis. Modern and user-friendly bioinformatics tools for analysis and interpretation of genomics data becomes essential during the analysis of sequencing data. Different standard data types and file formats have been developed to store and analyze sequence and genomics data. Variant Call Format (VCF) is the most widespread genomics file type and standard format containing genomic information and variants of sequenced samples.ResultsExisting tools for processing VCF files don’t usually have an intuitive graphical interface, but instead have just a command-line interface that may be challenging to use for the broader biomedical community interested in genomics data analysis. re-Searcher solves this problem by pre-processing VCF files by chunks to not load RAM of computer. The tool can be used as standalone user-friendly multiplatform GUI application as well as web application (https://nla-lbsb.nu.edu.kz). The software including source code as well as tested VCF files and additional information are publicly available on the GitHub repository (https://github.com/LabBandSB/re-Searcher).

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.ijpara.2025.12.005
Genetic diversity of Plasmodium falciparum equilibrative nucleoside transporters PfENT1 and PfENT4: Implications for purine-based antimalarial drug development.
  • Dec 1, 2025
  • International journal for parasitology
  • Worlanyo Tashie + 3 more

Genetic diversity of Plasmodium falciparum equilibrative nucleoside transporters PfENT1 and PfENT4: Implications for purine-based antimalarial drug development.

  • Research Article
  • 10.1158/1538-7445.am2017-2587
Abstract 2587: VCF2CNA: a tool for efficiently detecting copy number alteration using VCF genotype data
  • Jul 1, 2017
  • Cancer Research
  • Daniel K Putnam + 5 more

Whole genome sequencing (WGS) is increasingly used in both research and clinical settings. The Variant Call Format (VCF) specification is a widely adopted file format for genetic variation data exchange partially due to its smaller file size compared to raw WGS BAMs. Each variant in a typical VCF file contains its chromosome position, reference/alternative alleles and corresponding allele counts. This makes it possible to identify copy number alterations (CNAs). To this end, we developed VCF2CNA (http://vcf2cna.stjude.org), a web interface tool for CNA analysis from VCF files. A user of VCF2CNA, uploads a VCF file via the provided web interface. The entire analysis runs remotely with an average run time of 23 minutes. Results are emailed to the user as either a downloadable link or file attachments. VCF2CNA also accepts input in the Mutation Annotation Format (MAF) and the variant file format produced by the Bambino program. We analyzed 22 TCGA glioblastoma tumor/normal pairs by Illumina technology to evaluate VCF2CNA’s performance. It achieved high consistency (average F1-score: 0.952 ± 0.082) with CONSERTING, a tool that incorporated read-depth and SV data from raw BAMs for CNA detection. A segment-by-segment comparison between results from CONSERTING and VCF2CNA indicated that the latter was less sensitive to focal CNAs. This is expected because there is less information in the VCF input than in raw BAMs. Further analysis using samples with a “fractured genome” pattern revealed that VCF2CNA was more robust to library artifacts and produced relatively clean CNA profiles (on average 76.2-fold reduction compared to the number of segments reported by CONSERTING). Finally, we analyzed 137 pediatric neuroblastoma samples from the TARGET project, sequenced by Complete Genomics, Inc. (CGI) technology. MYCN amplification has been clinically validated in 33 samples. VCF2CNA identified high amplitude MYCN gains in 32 samples and the remaining sample carried a low-level broad gain covering MYCN. For comparison, CGI’s HMM-based method reported MYCN gains in only 15 out of the 33 samples. VCF2CNA further identified two additional MYCN amplifications among the remaining samples. Collectively, our analysis suggests that VCF2CNA is a platform-independent, efficient, robust and accurate tool for general WGS-based CNA analysis. It further complements CONSERTING, which produces more accurate result in focal CNAs at the cost of significantly higher computational burden. Citation Format: Daniel K. Putnam, Xiaotu Ma, Stephen V. Rice, Yu Liu, Jinghui Zhang, Xiang Chen. VCF2CNA: a tool for efficiently detecting copy number alteration using VCF genotype data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 2587. doi:10.1158/1538-7445.AM2017-2587

  • Research Article
  • Cite Count Icon 168
  • 10.1093/bioinformatics/btx145
SeqArray-a storage-efficient high-performance data format for WGS variant calls.
  • Mar 16, 2017
  • Bioinformatics (Oxford, England)
  • Xiuwen Zheng + 7 more

Whole-genome sequencing (WGS) data are being generated at an unprecedented rate. Analysis of WGS data requires a flexible data format to store the different types of DNA variation. Variant call format (VCF) is a general text-based format developed to store variant genotypes and their annotations. However, VCF files are large and data retrieval is relatively slow. Here we introduce a new WGS variant data format implemented in the R/Bioconductor package 'SeqArray' for storing variant calls in an array-oriented manner which provides the same capabilities as VCF, but with multiple high compression options and data access using high-performance parallel computing. Benchmarks using 1000 Genomes Phase 3 data show file sizes are 14.0 Gb (VCF), 12.3 Gb (BCF, binary VCF), 3.5 Gb (BGT) and 2.6 Gb (SeqArray) respectively. Reading genotypes in the SeqArray package are two to three times faster compared with the htslib C library using BCF files. For the allele frequency calculation, the implementation in the SeqArray package is over 5 times faster than PLINK v1.9 with VCF and BCF files, and over 16 times faster than vcftools. When used in conjunction with R/Bioconductor packages, the SeqArray package provides users a flexible, feature-rich, high-performance programming environment for analysis of WGS variant data. http://www.bioconductor.org/packages/SeqArray. zhengx@u.washington.edu. Supplementary data are available at Bioinformatics online.

  • Research Article
  • 10.1200/jco.2020.38.15_suppl.e14072
Evaluation of a regularly updated knowledge base for curation of somatic mutations detected in whole exomes of melanoma and lung, colorectal, and breast cancers.
  • May 20, 2020
  • Journal of Clinical Oncology
  • Stephanie J Yaung + 5 more

e14072 Background: Evolving medical guidelines and complex multi-variant data from next-generation sequencing (NGS) testing of cancer samples make routine clinical interpretation of somatic variants challenging. We assessed the ability of NAVIFY(R) Mutation Profiler*, a CE-IVD somatic variant interpretation tool, to yield accurate time- and geography-specific clinical content on 2511 samples from The Cancer Genome Atlas (TCGA) across six solid tumor types. Methods: Whole exomes from lung adenocarcinoma (n = 469), lung squamous cell carcinoma (n = 325), colon adenocarcinoma (n = 368), rectum adenocarcinoma (n = 149), breast invasive carcinoma (n = 806), and skin cutaneous melanoma (n = 394) cases were analyzed. We utilized TCGA data from the Multi-Center Mutation Calling in Multiple Cancers (MC3) project to obtain consensus calling results of single nucleotide variants and indels. The open-access Mutation Annotation Format (MAF) file (v0.2.8) that stores variant calls was lifted to human reference genome GRCh38 and converted to individual Variant Call Format (VCF) files per case. VCF files were uploaded to NAVIFY Mutation Profiler to interpret actionable mutations according to a highly curated and up-to-date knowledge base (Roche Content v2.13.0 released December 6, 2019). We further assessed the accuracy of interpreting co-occurrences of actionable mutations. Results: Over 1.24 million somatic mutations across 20,590 genes were assessed with NAVIFY Mutation Profiler, which reported tier classifications of variants based on consensus recommendations from AMP, ASCO, CAP, and ACMG. 86% of cases had variants of strong (Tier I-A or I-B) or potential (Tier II-C or II-D) clinical significance; 56% of these cases had Tier I classifications, supported by robust clinical evidence. Potentially actionable variant-variant interactions were found in 14% of cases. The tool also identified appropriate tier classifications by geographic region in accordance with local medical guidelines. Conclusions: To benchmark against other tools, we utilized available exome data from TCGA MC3 to assess NAVIFY Mutation Profiler. While this study likely underestimates the fraction of cases with actionable mutations, given that copy number alterations or rearrangements are also present in TCGA samples, we found a higher yield of potentially actionable annotation than other published methods. * This product has not been evaluated by the Food and Drug Administration and is not commercially available in the United States.

  • Research Article
  • Cite Count Icon 20
  • 10.1093/bioinformatics/btw748
Improved VCF normalization for accurate VCF comparison.
  • Dec 30, 2016
  • Bioinformatics
  • Arash Bayat + 3 more

The Variant Call Format (VCF) is widely used to store data about genetic variation. Variant calling workflows detect potential variants in large numbers of short sequence reads generated by DNA sequencing and report them in VCF format. To evaluate the accuracy of variant callers, it is critical to correctly compare their output against a reference VCF file containing a gold standard set of variants. However, comparing VCF files is a complicated task as an individual genomic variant can be represented in several different ways and is therefore not necessarily reported in a unique way by different software. We introduce a VCF normalization method called Best Alignment Normalisation (BAN) that results in more accurate VCF file comparison. BAN applies all the variations in a VCF file to the reference genome to create a sample genome, and then recalls the variants by aligning this sample genome back with the reference genome. Since the purpose of BAN is to get an accurate result at the time of VCF comparison, we define a better normalization method as the one resulting in less disagreement between the outputs of different VCF comparators. The BAN Linux bash script along with required software are publicly available on https://sites.google.com/site/banadf16. A.Bayat@unsw.edu.au. Supplementary data are available at Bioinformatics online.

  • Research Article
  • Cite Count Icon 18
  • 10.1534/genetics.106.065433
Variance of the Parental Genome Contribution to Inbred Lines Derived From Biparental Crosses
  • May 1, 2007
  • Genetics
  • Matthias Frisch + 1 more

The expectation of the parental genome contribution to inbred lines derived from biparental crosses or backcrosses is well known, but no theoretical results exist for its variance. Our objective was to derive the variance of the parental genome contribution to inbred lines developed by the single-seed descent or double haploid method from biparental crosses or backcrosses. We derived formulas and tabulated results for the variance of the parental genome contribution depending on the chromosome lengths and the mating scheme used for inbred line development. A normal approximation of the probability distribution function of the parental genome contribution fitted well the exact distribution obtained from computer simulations. We determined upper and lower quantiles of the parental genome contribution for model genomes of sugar beet, maize, and wheat using normal approximations. These can be employed to detect essentially derived varieties in the context of plant variety protection. Furthermore, we outlined the application of our results to predict the response to selection. Our results on the variance of the parental genome contribution can assist breeders and geneticists in the design of experiments or breeding programs by assessing the variation around the mean parental genome contribution for alternative crossing schemes.

  • Research Article
  • Cite Count Icon 11
  • 10.1534/genetics.106.057273
Marker-Based Prediction of the Parental Genome Contribution to Inbred Lines Derived From Biparental Crosses
  • Oct 1, 2006
  • Genetics
  • Matthias Frisch + 1 more

Molecular markers can be employed to predict the parental genome contribution to inbred lines. The proportion alpha of alleles originating from parent P1 at markers polymorphic between the parental lines P1 and P2 is commonly used as a predictor for the genome contribution of parent P1 to an offspring line. Our objectives were to develop a new marker-based predictor xi for the parental genome contribution, which takes into account not only the alleles at marker loci but also their map distance, and to compare the prediction precision of xi with that of alternative methods. We derived formulas for xi for inbreds derived from biparental crosses (F1 and backcrosses) with the single-seed descent or double-haploid method and presented an extension xi* possessing statistical optimum properties. In a simulation study, alpha showed a systematic overestimation of large parental genome contribution that was not observed for xi. The mean squared prediction error of xi was at least 50% smaller than that of alpha for linkage maps with unequal distances between adjacent markers. A data set from a study on plant variety protection in maize was used to illustrate the application of xi. We conclude that xi provides substantially greater prediction precision than the commonly used predictor alpha in a broad range of applications in genetics and breeding.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/smartworld-uic-atc-scalcom-iop-sci.2019.00282
Graph Data Modelling for Genomic Variants
  • Aug 1, 2019
  • Sanna Aizad + 1 more

Genome variant analysis is performed on Variant Call Format (VCF) files. It can take days to process these files for genome analytics due to challenges such as loading the files for each user query and processing them to answer questions of interest. As data sizes grow, timely processing of this data is putting enormous pressure on the computational resources, leading to significant processing delays and may jeopardise the ultimate goal of bringing the genomic discoveries to masses. We believe this problem will not be solved until the underlying data structure to organise and process these files undergoes a transformation. To overcome this problem, we have proposed a graph based system to represent the data in VCF files. This allows the data to be loaded once in a graph model which is then subsequently queried and processed numerous times without any additional computational and data access penalties. This helps reduce data access time by giving a constant time access to any node and addresses performance and scalability challenges that have been a limiting factor for the mass scale adoption of genome analytics. It takes only 2ms to access any data node in our graph model and remains constant for any number of nodes.

  • Research Article
  • Cite Count Icon 7
  • 10.1093/bioinformatics/btab211
VCFShark: how to squeeze a VCF file.
  • Mar 31, 2021
  • Bioinformatics
  • Sebastian Deorowicz + 2 more

Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. https://github.com/refresh-bio/vcfshark. Supplementary data are available at Bioinformatics online.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.1093/biomethods/bpac012
Prediction of risk-associated genes and high-risk liver cancer patients from their mutation profile: benchmarking of mutation calling techniques
  • Jan 10, 2022
  • Biology Methods & Protocols
  • Sumeet Patiyal + 2 more

Identification of somatic mutations with high precision is one of the major challenges in the prediction of high-risk liver cancer patients. In the past, number of mutations calling techniques has been developed that include MuTect2, MuSE, Varscan2, and SomaticSniper. In this study, an attempt has been made to benchmark the potential of these techniques in predicting the prognostic biomarkers for liver cancer. Initially, we extracted somatic mutations in liver cancer patients using Variant Call Format (VCF) and Mutation Annotation Format (MAF) files from the cancer genome atlas. In terms of size, the MAF files are 42 times smaller than VCF files and containing only high-quality somatic mutations. Furthermore, machine learning-based models have been developed for predicting high-risk cancer patients using mutations obtained from different techniques. The performance of different techniques and data files has been compared based on their potential to discriminate high- and low-risk liver cancer patients. Based on correlation analysis, we selected 80 genes having significant negative correlation with the overall survival of liver cancer patients. The univariate survival analysis revealed the prognostic role of highly mutated genes. Single gene-based analysis showed that MuTect2 technique-based MAF file has achieved maximum hazard ratio (HRLAMC3) of 9.25 with P-value of 1.78E-06. Further, we developed various prediction models using risk-associated top-10 genes for each technique. Our results indicate that MuTect2 technique-based VCF files outperform all other methods with maximum Area Under the Receiver-Operating Characteristic curve of 0.765 and HR = 4.50 (P-value = 3.83E-15). Eventually, VCF file generated using MuTect2 technique performs better among other mutation calling techniques for the prediction of high-risk liver cancer patients. We hope that our findings will provide a useful and comprehensive comparison of various mutation-calling techniques for the prognostic analysis of cancer patients. In order to serve the scientific community, we have provided a Python-based pipeline to develop the prediction models using mutation profiles (VCF/MAF) of cancer patients. It is available on GitHub at https://github.com/raghavagps/mutation_bench.

  • Research Article
  • Cite Count Icon 1
  • 10.7717/peerj.16086
DisVar: an R library for identifying variants associated with diseases using large-scale personal genetic information.
  • Sep 28, 2023
  • PeerJ
  • Khunanon Chanasongkhram + 2 more

Genetic variants may potentially play a contributing factor in the development of diseases. Several genetic disease databases are used in medical research and diagnosis but the web applications used to search these databases for disease-associated variants have limitations. The application may not be able to search for large-scale genetic variants, the results of searches may be difficult to interpret and variants mapped from the latest reference genome (GRCH38/hg38) may not be supported. In this study, we developed a novel R library called "DisVar" to identify disease-associated genetic variants in large-scale individual genomic data. This R library is compatible with variants from the latest reference genome version. DisVar uses five databases of disease-associated variants. Over 100 million variants can be simultaneously searched for specific associated diseases. The package was evaluated using 24 Variant Call Format (VCF) files (215,054 to 11,346,899 sites) from the 1000 Genomes Project. Disease-associated variants were detected in 298,227 hits across all the VCF files, taking a total of 63.58 m to complete. The package was also tested on ClinVar's VCF file (2,120,558 variants), where 20,657 hits associated with diseases were identified with an estimated elapsed time of 45.98 s. DisVar can overcome the limitations of existing tools and is a fast and effective diagnostic and preventive tool that identifies disease-associated variations from large-scale genetic variants against the latest reference genome.

  • Discussion
  • Cite Count Icon 1
  • 10.5256/f1000research.120539.r125389
Recommendations for the formatting of Variant Call Format (VCF) files to make plant genotyping data FAIR
  • Mar 2, 2022
  • F1000Research
  • Boas Pucker + 1 more

In this opinion article, we discuss the formatting of files from (plant) genotyping studies, in particular the formatting of metadata in Variant Call Format (VCF) files. The flexibility of the VCF format specification facilitates its use as a generic interchange format across domains but can lead to inconsistency between files in the presentation of metadata. To enable fully autonomous machine actionable data flow, generic elements need to be further specified. We strongly support the merits of the FAIR principles and see the need to facilitate them also through technical implementation specifications. They form a basis for the proposed VCF extensions here. We have learned from the existing application of VCF that the definition of relevant metadata using controlled standards, vocabulary and the consistent use of cross-references via resolvable identifiers (machine-readable) are particularly necessary and propose their encoding. VCF is an established standard for the exchange and publication of genotyping data. Other data formats are also used to capture variant data (for example, the HapMap and the gVCF formats), but none currently have the reach of VCF. For the sake of simplicity, we will only discuss VCF and our recommendations for its use, but these recommendations could also be applied to gVCF. However, the part of the VCF standard relating to metadata (as opposed to the actual variant calls) defines a syntactic format but no vocabulary, unique identifier or recommended content. In practice, often only sparse descriptive metadata is included. When descriptive metadata is provided, proprietary metadata fields are frequently added that have not been agreed upon within the community which may limit long-term and comprehensive interoperability. To address this, we propose recommendations for supplying and encoding metadata, focusing on use cases from plant sciences. We expect there to be overlap, but also divergence, with the needs of other domains.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant