PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria.

  • Abstract
  • Highlights & Summary
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

BackgroundCataloguing the distribution of genes within natural bacterial populations is essential for understanding evolutionary processes and the genetic basis of adaptation. Advances in whole genome sequencing technologies have led to a vast expansion in the amount of bacterial genomes deposited in public databases. There is a pressing need for software solutions which are able to cluster, catalogue and characterise genes, or other features, in increasingly large genomic datasets.ResultsHere we present a pangenomics toolbox, PIRATE (Pangenome Iterative Refinement and Threshold Evaluation), which identifies and classifies orthologous gene families in bacterial pangenomes over a wide range of sequence similarity thresholds. PIRATE builds upon recent scalable software developments to allow for the rapid interrogation of thousands of isolates. PIRATE clusters genes (or other annotated features) over a wide range of amino acid or nucleotide identity thresholds and uses the clustering information to rapidly identify paralogous gene families and putative fission/fusion events. Furthermore, PIRATE orders the pangenome using a directed graph, provides a measure of allelic variation, and estimates sequence divergence for each gene family.ConclusionsWe demonstrate that PIRATE scales linearly with both number of samples and computation resources, allowing for analysis of large genomic datasets, and compares favorably to other popular tools. PIRATE provides a robust framework for analysing bacterial pangenomes, from largely clonal to panmictic species.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1186/s13321-024-00898-x
RAIChU: automating the visualisation of natural product biosynthesis
  • Sep 3, 2024
  • Journal of Cheminformatics
  • Barbara R Terlouw + 5 more

Natural products are molecules that fulfil a range of important ecological functions. Many natural products have been exploited for pharmaceutical and agricultural applications. In contrast to many other specialised metabolites, the products of modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) systems can often (partially) be predicted from the DNA sequence of the biosynthetic gene clusters. This is because the biosynthetic pathways of NRPS and PKS systems adhere to consistent rulesets. These universal biosynthetic rules can be leveraged to generate biosynthetic models of biosynthetic pathways. While these principles have been largely deciphered, software that leverages these rules to automatically generate visualisations of biosynthetic models has not yet been developed. To enable high-quality automated visualisations of natural product biosynthetic pathways, we developed RAIChU (Reaction Analysis through Illustrating Chemical Units), which produces depictions of biosynthetic transformations of PKS, NRPS, and hybrid PKS/NRPS systems from predicted or experimentally verified module architectures and domain substrate specificities. RAIChU also boasts a library of functions to perform and visualise reactions and pathways whose specifics (e.g., regioselectivity, stereoselectivity) are still difficult to predict, including terpenes, ribosomally synthesised and posttranslationally modified peptides and alkaloids. Additionally, RAIChU includes 34 prevalent tailoring reactions to enable the visualisation of biosynthetic pathways of fully maturated natural products. RAIChU can be integrated into Python pipelines, allowing users to upload and edit results from antiSMASH, a widely used BGC detection and annotation tool, or to build biosynthetic PKS/NRPS systems from scratch. RAIChU’s cluster drawing correctness (100%) and drawing readability (97.66%) were validated on 5000 randomly generated PKS/NRPS systems, and on the MIBiG database. The automated visualisation of these pathways accelerates the generation of biosynthetic models, facilitates the analysis of large (meta-) genomic datasets and reduces human error. RAIChU is available at https://github.com/BTheDragonMaster/RAIChU and https://pypi.org/project/raichu.Scientific contributionRAIChU is the first software package capable of automating high-quality visualisations of natural product biosynthetic pathways. By leveraging universal biosynthetic rules, RAIChU enables the depiction of complex biosynthetic transformations for PKS, NRPS, ribosomally synthesised and posttranslationally modified peptide (RiPP), terpene and alkaloid systems, enhancing predictive and analytical capabilities. This innovation not only streamlines the creation of biosynthetic models, making the analysis of large genomic datasets more efficient and accurate, but also bridges a crucial gap in predicting and visualising the complexities of natural product biosynthesis.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 14
  • 10.1098/rstb.2020.0503
Genetic basis of speciation and adaptation: from loci to causative mutations.
  • May 30, 2022
  • Philosophical Transactions of the Royal Society B
  • Jun Kitano + 3 more

Does evolution proceed in small steps or large leaps? How repeatable is evolution? How constrained is the evolutionary process? Answering these long-standing questions in evolutionary biology is indispensable for both understanding how extant biodiversity has evolved and predicting how organisms and ecosystems will respond to changing environments in the future. Understanding the genetic basis of phenotypic diversification and speciation in natural populations is key to properly answering these questions. The leap forward in genome sequencing technologies has made it increasingly easier to not only investigate the genetic architecture but also identify the variant sites underlying adaptation and speciation in natural populations. Furthermore, recent advances in genome editing technologies are making it possible to investigate the functions of each candidate gene in organisms from natural populations. In this article, we discuss how these recent technological advances enable the analysis of causative genes and mutations and how such analysis can help answer long-standing evolutionary biology questions.This article is part of the theme issue ‘Genetic basis of adaptation and speciation: from loci to causative mutations’.

  • Research Article
  • Cite Count Icon 3
  • 10.1002/cnr2.1267
Steroid receptor-associated and regulated protein is a biomarker in predicting the clinical outcome and treatment response in malignancies.
  • Jul 24, 2020
  • Cancer Reports
  • Ali Naderi

Steroid receptor-associated and regulated protein (SRARP) has recently been identified as a novel tumor suppressor in malignancies of multiple tissue origins. SRARP is located on chromosome 1p36.13 and is widely inactivated by deletions and epigenetic silencing in malignancies. Therefore, additional studies are required to explore SRARP as a potential cancer biomarker. This study explores the application of SRARP as a novel biomarker in malignancies of multiple tissue origins using the analysis of large genomic datasets. A comprehensive genomic analysis of large cancer datasets was carried out to examine the association of SRARP expression and copy-number with molecular and clinical features in malignancies of multiple tissue origins. This study demonstrated that SRARP under-expression and copy-number loss are strongly associated with the loss of other tumor suppressors such as TP53 and NF1 mutations and oncogenic gains, including N-MYC amplification and ERG rearrangement, suggesting that SRARP inactivation is associated with wider genomic instability in malignancies. Importantly, SRARP under-expression and copy-number loss are strong predictors of poor clinical and/or pathological features in breast, colorectal, lung, prostate, gastric, endometrial, cervical, brain, ovarian, bladder, thyroid, and hepatocellular cancers as well as neuroblastoma, uveal melanoma, and acute myeloid leukemia with highly significant odds ratios. Finally, higher SRARP expression and copy-number predict a better response to several cancer drugs. This study suggests that the SRARP inactivation presents a robust biomarker in predicting molecular and clinicopathological features, and treatment response in malignancies.

  • Research Article
  • 10.17816/eid636869
Analysis of the genetic features of the structural organization of integrative conjugative elements of <i>Vibrio cholerae</i> strains of various origins
  • Nov 1, 2024
  • Epidemiology and Infectious Diseases
  • Alexey S Vodopyanov + 3 more

Background: Integrative conjugative elements (ICEs) play a significant role in the dissemination of antibiotic resistance genes among Vibrio cholerae strains. However, there are currently no standardized methods for ICE typing that allow for the analysis of large genomic datasets. Aim: To conduct a comparative analysis of ICE sequences in Vibrio cholerae strains of various origins and to develop an algorithm for their typing. Materials and methods: The study utilized whole-genome sequencing data from 120 toxigenic (ctxAB+tcpA+) V. cholerae O1 El Tor strains obtained using the MiSeq platform (Illumina, USA) and MinION platform (Oxford Nanopore, UK), as well as data from NCBI databases (1,886 genomes) and the European Nucleotide Archive (441 strains). The software for ICE detection and typing was developed in Java (version 11.0.13) and is available at: http://antiplague.ru/ice-genotyper/. Results: A comparative analysis of ICE elements in toxigenic V. cholerae strains was performed. An ICE typing algorithm based on gene composition was proposed. Analysis of the V. cholerae genome collection revealed three previously undescribed ICE elements, designated ICEVchRus1, ICEVchHai3, and ICEVchLaos. Conclusions: The study identified three previously undescribed ICE elements and mapped their distribution across Russia and other regions of the world. It was established that during the cholera outbreak in Dagestan in 1994, strains containing ICEVchBan11 and ICEVchBan9 were circulating simultaneously.

  • Research Article
  • Cite Count Icon 24
  • 10.1515/sagmb-2014-0082
CSI: a nonparametric Bayesian approach to network inference from multiple perturbed time series gene expression data.
  • Jan 1, 2015
  • Statistical Applications in Genetics and Molecular Biology
  • Christopher A Penfold + 4 more

Here we introduce the causal structure identification (CSI) package, a Gaussian process based approach to inferring gene regulatory networks (GRNs) from multiple time series data. The standard CSI approach infers a single GRN via joint learning from multiple time series datasets; the hierarchical approach (HCSI) infers a separate GRN for each dataset, albeit with the networks constrained to favor similar structures, allowing for the identification of context specific networks. The software is implemented in MATLAB and includes a graphical user interface (GUI) for user friendly inference. Finally the GUI can be connected to high performance computer clusters to facilitate analysis of large genomic datasets.

  • Dissertation
  • 10.17760/d20289813
Helping scientists see : supporting healthcare and bioinformatics through visual analytics
  • Jan 1, 2018
  • Solano-Román

Scientific research and discovery in the field of bioinformatics have seen a tremendous increase in recent years through the advent of low-cost genetic sequencing and better healthcare programs. At the same time, this situation poses new important challenges for the analyses of genetic data, as many of the current visualization software were not originally designed to manage the large datasets that continue to become available on a regular basis. While active in many other domains such as finance and journalism, most data visualization designers have remained as bystanders in the fields of healthcare and life sciences, and new visualization tools are steadily being developed without the expertise and knowledge of these design professionals. However, the visualization tools for the exploration of these data can no longer be only developed by bioinformaticians if we wish to realize the full potential of modern technologies and to extract actionable insights from new large datasets. In this thesis, I propose that designers can play an important role as mediators in transdisciplinary groups that come together to create user-centered digital products for non-linear visual analytics. This thesis explores the creation of visualization tools for the analysis of large genomic datasets, especially by proposing a redesign of Multiple Sequence Alignment visualizations, while at the same time presenting a replicable model for collaboration for the design of these tools in healthcare and bioinformatics.

  • Research Article
  • Cite Count Icon 2
  • 10.1007/978-1-4939-7463-4_11
Sequence-Based Synteny Analysis of Multiple Large Genomes.
  • Dec 26, 2017
  • Methods in molecular biology (Clifton, N.J.)
  • Daniel Doerr + 1 more

Current methods for synteny analysis provide only limited support to study large genomes at the sequence level. In this chapter, we describe a pipeline based on existing tools that, applied in a suitable fashion, enables synteny analysis of large genomic datasets. We give a hands-on description of each step of the pipeline using four avian genomes for data. We also provide integration scripts that simplify the conversion and setup of data between the different tools in the pipeline.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 1
  • 10.1038/s41698-025-00918-5
Comparative analysis of RNA expression identifies effective targeted drug in myoepithelial carcinoma
  • May 17, 2025
  • npj Precision Oncology
  • Yvonne A Vasquez + 15 more

Myoepithelial carcinoma is an ultra-rare pediatric solid tumor with no targeted treatments. Clinical implementation of tumor RNA sequencing (RNA-Seq) for identifying therapeutic targets is underexplored in pediatric cancer. We previously published the Comparative Analysis of RNA Expression (CARE), a framework for incorporating RNA-Seq-derived gene expression into the clinic for difficult-to-treat pediatric cancers. Here, we discuss a 4-year-old male diagnosed with myoepithelial carcinoma who was treated at Stanford Medicine Children’s Health. A metastatic lung nodule from the patient underwent standard-of-care tumor DNA profiling and CARE analysis, wherein the patient’s tumor RNA-Seq profile was compared to over 11,000 uniformly analyzed tumor profiles from public data repositories. DNA profiling yielded no actionable mutations. CARE identified overexpression biomarkers and nominated a treatment that produced a durable clinical response. These findings underscore the utility of data sharing and concurrent analysis of large genomic datasets for clinical benefit, particularly for rare cancers with unknown biological drivers.

  • Research Article
  • 10.17116/labs20241304149
High-throughput sequencing: a powerful tool for pathogen detection and identification in clinical samples
  • Apr 22, 2024
  • Laboratory Service
  • D.A Grigoryan + 3 more

Recent advancements in high-throughput sequencing have significantly expanded the possibilities for diagnosing and surveilling infectious diseases. This article offers a comprehensive examination of key sequencing methodologies, emphasizing their unique characteristics and potential applications in clinical practice. Both metagenomic and targeted approaches are assessed in terms of their effectiveness, limitations, and feasibility across various diagnostic fields. We also discuss current challenges in standardization and the analysis of large genomic datasets, highlighting the urgent need for innovative bioinformatics solutions to streamline these processes. The technologies explored in this article are already transforming strategies for diagnosing infectious diseases, and their continued development may play a crucial role in the evolution of clinical diagnostic systems.

  • Research Article
  • Cite Count Icon 10
  • 10.1093/bioinformatics/btx080
VALORATE: fast and accurate log-rank test in balanced and unbalanced comparisons of survival curves and cancer genomics.
  • Feb 10, 2017
  • Bioinformatics (Oxford, England)
  • Victor Treviño + 1 more

The association of genomic alterations to outcomes in cancer is affected by a problem of unbalanced groups generated by the low frequency of alterations. For this, an R package (VALORATE) that estimates the null distribution and the P -value of the log-rank based on a recent reformulation is presented. For a given number of alterations that define the size of survival groups, the log-rank density is estimated by a weighted sum of conditional distributions depending on a co-occurrence term of mutations and events. The estimations are accurately accelerated by sampling across co-occurrences allowing the analysis of large genomic datasets in few minutes. In conclusion, the proposed VALORATE R package is a valuable tool for survival analysis. The R package is available in CRAN at https://cran.r-project.org and in http://bioinformatica.mty.itesm.mx/valorateR . vtrevino@itesm.mx. Supplementary data are available at Bioinformatics online.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.1371/journal.pone.0108490
Benchmarking Undedicated Cloud Computing Providers for Analysis of Genomic Datasets
  • Sep 23, 2014
  • PLoS ONE
  • Seyhan Yazar + 3 more

A major bottleneck in biological discovery is now emerging at the computational level. Cloud computing offers a dynamic means whereby small and medium-sized laboratories can rapidly adjust their computational capacity. We benchmarked two established cloud computing services, Amazon Web Services Elastic MapReduce (EMR) on Amazon EC2 instances and Google Compute Engine (GCE), using publicly available genomic datasets (E.coli CC102 strain and a Han Chinese male genome) and a standard bioinformatic pipeline on a Hadoop-based platform. Wall-clock time for complete assembly differed by 52.9% (95% CI: 27.5–78.2) for E.coli and 53.5% (95% CI: 34.4–72.6) for human genome, with GCE being more efficient than EMR. The cost of running this experiment on EMR and GCE differed significantly, with the costs on EMR being 257.3% (95% CI: 211.5–303.1) and 173.9% (95% CI: 134.6–213.1) more expensive for E.coli and human assemblies respectively. Thus, GCE was found to outperform EMR both in terms of cost and wall-clock time. Our findings confirm that cloud computing is an efficient and potentially cost-effective alternative for analysis of large genomic datasets. In addition to releasing our cost-effectiveness comparison, we present available ready-to-use scripts for establishing Hadoop instances with Ganglia monitoring on EC2 or GCE.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 23
  • 10.1016/j.ygeno.2020.10.032
Defining the clinical genomic landscape for real-world precision oncology
  • Nov 1, 2020
  • Genomics
  • Philip A Beer + 3 more

Through the delivery of large international projects including ICGC and TCGA, knowledge of cancer genomics is reaching saturation point. Enabling this to improve patient outcomes now requires embedding comprehensive genomic profiling into routine oncology practice. Towards this goal, this study defined the biologically and clinically relevant genomic features of adult cancer through detailed curation and analysis of large genomic datasets, accumulated literature and biomarker-driven therapeutics in clinic and development. The characteristics and prevalence of these features were then interrogated in 2348 whole genome sequences, covering 21 solid tumour types, generated by the PCAWG project. This analysis highlights the predominant contribution of copy number alterations and identifies a critical role for disruptive structural variants in the inactivation of clinically important tumour suppressor genes, including PTEN and RB1, which are not currently captured by diagnostic assays. This study defines a set of essential genomic features for the characterisation of common adult cancers.

  • Research Article
  • Cite Count Icon 418
  • 10.1038/s41467-020-15816-6
Accurate estimation of cell composition in bulk expression through robust integration of single-cell information
  • Apr 24, 2020
  • Nature Communications
  • Brandon Jew + 9 more

We present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and snRNA-seq data, Bisque replicates previously reported associations between cell type proportions and measured phenotypes across abundant and rare cell types. We further propose an additional mode of operation that merely requires a set of known marker genes.

  • Book Chapter
  • Cite Count Icon 18
  • 10.1016/b978-0-12-398323-7.00005-7
Chapter Five - Using Genome-Wide Expression Profiling to Define Gene Networks Relevant to the Study of Complex Traits: From RNA Integrity to Network Topology
  • Jan 1, 2012
  • International Review of Neurobiology
  • M.A O'Brien + 2 more

Chapter Five - Using Genome-Wide Expression Profiling to Define Gene Networks Relevant to the Study of Complex Traits: From RNA Integrity to Network Topology

  • Research Article
  • Cite Count Icon 2
  • 10.1371/journal.pone.0286330
HGSuite HyperBrowser: A web-based toolkit for hierarchical metadata-informed analysis of genomic tracks
  • Jul 19, 2023
  • PLOS ONE
  • Sumana Kalyanasundaram + 8 more

Many high-throughput sequencing datasets can be represented as objects with coordinates along a reference genome. Currently, biological investigations often involve a large number of such datasets, for example representing different cell types or epigenetic factors. Drawing overall conclusions from a large collection of results for individual datasets may be challenging and time-consuming. Meaningful interpretation often requires the results to be aggregated according to metadata that represents biological characteristics of interest. In this light, we here propose the hierarchical Genomic Suite HyperBrowser (hGSuite), an open-source extension to the GSuite HyperBrowser platform, which aims to provide a means for extracting key results from an aggregated collection of high-throughput DNA sequencing data. The hGSuite utilizes a metadata-informed data cube to calculate various statistics across the multiple dimensions of the datasets. With this work, we show that the hGSuite and its associated data cube methodology offers a quick and accessible way for exploratory analysis of large genomic datasets. The web-based toolkit named hGsuite Hyperbrowser is available at https://hyperbrowser.uio.no/hgsuite under a GPLv3 license.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant