1,154 publications found
Sort by
ViReMa: a virus recombination mapper of next-generation sequencing data characterizes diverse recombinant viral nucleic acids.

Genetic recombination is a tremendous source of intrahost diversity in viruses and is critical for their ability to rapidly adapt to new environments or fitness challenges. While viruses are routinely characterized using high-throughput sequencing techniques, characterizing the genetic products of recombination in next-generation sequencing data remains a challenge. Viral recombination events can be highly diverse and variable in nature, including simple duplications and deletions, or more complex events such as copy/snap-back recombination, intervirus or intersegment recombination, and insertions of host nucleic acids. Due to the variable mechanisms driving virus recombination and the different selection pressures acting on the progeny, recombination junctions rarely adhere to simple canonical sites or sequences. Furthermore, numerous different events may be present simultaneously in a viral population, yielding a complex mutational landscape. We have previously developed an algorithm called ViReMa (Virus Recombination Mapper) that bootstraps the bowtie short-read aligner to capture and annotate a wide range of recombinant species found within virus populations. Here, we have updated ViReMa to provide an "error density" function designed to accurately detect recombination events in the longer reads now routinely generated by the Illumina platforms and provide output reports for multiple types of recombinant species using standardized formats. We demonstrate the utility and flexibility of ViReMa in different settings to report deletion events in simulated data from Flock House virus, copy-back RNA species in Sendai viruses, short duplication events in HIV, and virus-to-host recombination in an archaeal DNA virus.

Open Access
Relevant
Honey bee (<i>Apis mellifera</i>) wing images: a tool for identification and conservation

The honey bee (Apis mellifera) is an ecologically and economically important species that provides pollination services to natural and agricultural systems. The biodiversity of the honey bee in parts of its native range is endangered by migratory beekeeping and commercial breeding. In consequence, some honey bee populations that are well adapted to the local environment are threatened with extinction. A crucial step for the protection of honey bee biodiversity is reliable differentiation between native and nonnative bees. One of the methods that can be used for this is the geometric morphometrics of wings. This method is fast, is low cost, and does not require expensive equipment. Therefore, it can be easily used by both scientists and beekeepers. However, wing geometric morphometrics is challenging due to the lack of reference data that can be reliably used for comparisons between different geographic regions. Here, we provide an unprecedented collection of 26,481 honey bee wing images representing 1,725 samples from 13 European countries. The wing images are accompanied by the coordinates of 19 landmarks and the geographic coordinates of the sampling locations. We present an R script that describes the workflow for analyzing the data and identifying an unknown sample. We compared the data with available reference samples for lineage and found general agreement with them. The extensive collection of wing images available on the Zenodo website can be used to identify the geographic origin of unknown samples and therefore assist in the monitoring and conservation of honey bee biodiversity in Europe.

Open Access
Relevant
The Australasian dingo archetype: de novo chromosome-length genome assembly, DNA methylome, and cranial morphology.

One difficulty in testing the hypothesis that the Australasian dingo is a functional intermediate between wild wolves and domesticated breed dogs is that there is no reference specimen. Here we link a high-quality de novo long-read chromosomal assembly with epigenetic footprints and morphology to describe the Alpine dingo female named Cooinda. It was critical to establish an Alpine dingo reference because this ecotype occurs throughout coastal eastern Australia where the first drawings and descriptions were completed. We generated a high-quality chromosome-level reference genome assembly (Canfam_ADS) using a combination of Pacific Bioscience, Oxford Nanopore, 10X Genomics, Bionano, and Hi-C technologies. Compared to the previously published Desert dingo assembly, there are large structural rearrangements on chromosomes 11, 16, 25, and 26. Phylogenetic analyses of chromosomal data from Cooinda the Alpine dingo and 9 previously published de novo canine assemblies show dingoes are monophyletic and basal to domestic dogs. Network analyses show that the mitochondrial DNA genome clusters within the southeastern lineage, as expected for an Alpine dingo. Comparison of regulatory regions identified 2 differentially methylated regions within glucagon receptor GCGR and histone deacetylase HDAC4 genes that are unmethylated in the Alpine dingo genome but hypermethylated in the Desert dingo. Morphologic data, comprising geometric morphometric assessment of cranial morphology, place dingo Cooinda within population-level variation for Alpine dingoes. Magnetic resonance imaging of brain tissue shows she had a larger cranial capacity than a similar-sized domestic dog. These combined data support the hypothesis that the dingo Cooinda fits the spectrum of genetic and morphologic characteristics typical of the Alpine ecotype. We propose that she be considered the archetype specimen for future research investigating the evolutionary history, morphology, physiology, and ecology of dingoes. The female has been taxidermically prepared and is now at the Australian Museum, Sydney.

Open Access
Relevant
Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim

Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.

Open Access
Relevant
Resequencing of a Pekin duck breeding population provides insights into the genomic response to short-term artificial selection

Short-term, intense artificial selection drives fast phenotypic changes in domestic animals and leaves imprints on their genomes. However, the genetic basis of this selection response is poorly understood. To better address this, we employed the Pekin duck Z2 pure line, in which the breast muscle weight was increased nearly 3-fold after 10 generations of breeding. We denovo assembled a high-quality reference genome of a female Pekin duck of this line (GCA_003850225.1) and identified 8.60 million genetic variants in 119 individuals among 10 generations of the breeding population. We identified 53 selected regions between the first and tenth generations, and 93.8% of the identified variations were enriched in regulatory and noncoding regions. Integrating the selection signatures and genome-wide association approach, we found that 2 regions covering 0.36 Mb containing UTP25 and FBRSL1 were most likely to contribute to breast muscle weight improvement. The major allele frequencies of these 2 loci increased gradually with each generation following the same trend. Additionally, we found that a copy number variation region containing the entire EXOC4 gene could explain 1.9% of the variance in breast muscle weight, indicating that the nervous system may play a role in economic trait improvement. Our study not only provides insights into genomic dynamics under intense artificial selection but also provides resources for genomics-enabled improvements in duck breeding.

Open Access
Relevant
A molecular phenotypic map of malignant pleural mesothelioma

Malignant pleural mesothelioma (MPM) is a rare understudied cancer associated with exposure to asbestos. So far, MPM patients have benefited marginally from the genomics medicine revolution due to the limited size or breadth of existing molecular studies. In the context of the MESOMICS project, we have performed the most comprehensive molecular characterization of MPM to date, with the underlying dataset made of the largest whole-genome sequencing series yet reported, together with transcriptome sequencing and methylation arrays for 120 MPM patients. We first provide comprehensive quality controls for all samples, of both raw and processed data. Due to the difficulty in collecting specimens from such rare tumors, a part of the cohort does not include matched normal material. We provide a detailed analysis of data processing of these tumor-only samples, showing that all somatic alteration calls match very stringent criteria of precision and recall. Finally, integrating our data with previously published multiomic MPM datasets (n = 374 in total), we provide an extensive molecular phenotype map of MPM based on the multitask theory. The generated map can be interactively explored and interrogated on the UCSC TumorMap portal (https://tumormap.ucsc.edu/?p=RCG_MESOMICS/MPM_Archetypes ). This new high-quality MPM multiomics dataset, together with the state-of-art bioinformatics and interactive visualization tools we provide, will support the development of precision medicine in MPM that is particularly challenging to implement in rare cancers due to limited molecular studies.

Open Access
Relevant
An accessible infrastructure for artificial intelligence using a Docker-based JupyterLab in Galaxy

Artificial intelligence (AI) programs that train on large datasets require powerful compute infrastructure consisting of several CPU cores and GPUs. JupyterLab provides an excellent framework for developing AI programs, but it needs to be hosted on such an infrastructure to enable faster training of AI programs using parallel computing. An open-source, docker-based, and GPU-enabled JupyterLab infrastructure is developed that runs on the public compute infrastructure of Galaxy Europe consisting of thousands of CPU cores, many GPUs, and several petabytes of storage to rapidly prototype and develop end-to-end AI projects. Using a JupyterLab notebook, long-running AI model training programs can also be executed remotely to create trained models, represented in open neural network exchange (ONNX) format, and other output datasets in Galaxy. Other features include Git integration for version control, the option of creating and executing pipelines of notebooks, and multiple dashboards and packages for monitoring compute resources and visualization, respectively. These features make JupyterLab in Galaxy Europe highly suitable for creating and managing AI projects. A recent scientific publication that predicts infected regions in COVID-19 computed tomography scan images is reproduced using various features of JupyterLab on Galaxy Europe. In addition, ColabFold, a faster implementation of AlphaFold2, is accessed in JupyterLab to predict the 3-dimensional structure of protein sequences. JupyterLab is accessible in 2 ways-one as an interactive Galaxy tool and the other by running the underlying Docker container. In both ways, long-running training can be executed on Galaxy's compute infrastructure. Scripts to create the Docker container are available under MIT license at https://github.com/usegalaxy-eu/gpu-jupyterlab-docker.

Open Access
Relevant
A workflow reproducibility scale for automatic validation of biological interpretation results

Reproducibility of data analysis workflow is a key issue in the field of bioinformatics. Recent computing technologies, such as virtualization, have made it possible to reproduce workflow execution with ease. However, the reproducibility of results is not well discussed; that is, there is no standard way to verify whether the biological interpretation of reproduced results is the same. Therefore, it still remains a challenge to automatically evaluate the reproducibility of results. We propose a new metric, a reproducibility scale of workflow execution results, to evaluate the reproducibility of results. This metric is based on the idea of evaluating the reproducibility of results using biological feature values (e.g., number of reads, mapping rate, and variant frequency) representing their biological interpretation. We also implemented a prototype system that automatically evaluates the reproducibility of results using the proposed metric. To demonstrate our approach, we conducted an experiment using workflows used by researchers in real research projects and the use cases that are frequently encountered in the field of bioinformatics. Our approach enables automatic evaluation of the reproducibility of results using a fine-grained scale. By introducing our approach, it is possible to evolve from a binary view of whether the results are superficially identical or not to a more graduated view. We believe that our approach will contribute to more informed discussion on reproducibility in bioinformatics.

Open Access
Relevant