- New
- Research Article
- 10.1093/gigascience/giaf154
- Dec 12, 2025
- GigaScience
- Qian Qin + 1 more
Structural variants (SVs) are genomic differences ≥50bp in length. They remain challenging to detect even with long sequence reads, and the sources of these difficulties are not well quantified. We identified 35.4 Mb of low-complexity regions (LCRs) in GRCh38. Although these regions cover only 1.2% of the genome, they contain 69.1% of confident SVs in sample HG002. Across long-read SV callers, 77.3-91.3% of erroneous SV calls occur within LCRs, with error rates increasing with LCR length. SVs are enriched and difficult to call in LCRs. Special care needs to be taken for calling and analyzing these variants.
- New
- Research Article
- 10.1093/gigascience/giaf152
- Dec 12, 2025
- GigaScience
- Sierra A T Moxon + 34 more
Scientific research relies on well-structured, standardized data; however, much of it is stored in formats such as free-text lab notebooks, non-standardized spreadsheets, or data repositories. This lack of structure challenges interoperability, making data integration, validation, and reuse difficult. LinkML (Linked Data Modeling Language) is an open framework that simplifies the process of authoring, validating, and sharing data. LinkML can describe a range of data structures, from flat, list-based models to complex, interrelated, and normalized models that utilize polymorphism and compound inheritance. It offers an approachable syntax that is not tied to any one technical architecture and can be integrated seamlessly with many existing frameworks. The LinkML syntax provides a standard way to describe schemas, classes, and relationships, allowing modelers to build well-defined, stable, and optionally ontology-aligned data structures. Once defined, LinkML schemas may be imported into other LinkML schemas. These key features make LinkML an accessible platform for interdisciplinary collaboration and a reliable way to define and share data semantics. LinkML helps reduce heterogeneity, complexity, and the proliferation of single-use data models while simultaneously enabling compliance with FAIR data standards. LinkML has seen increasing adoption in various fields, including biology, chemistry, biomedicine, microbiome research, finance, electrical engineering, transportation, and commercial software development. In short, LinkML makes implicit models explicitly computable and allows data to be standardized at its origin. LinkML documentation and code are available at linkml.io.
- New
- Research Article
- 10.1093/gigascience/giaf151
- Dec 11, 2025
- GigaScience
- Megha Sharma + 11 more
Safflower (Carthamus tinctorius L.) is a drought-resilient oilseed crop. Besides producing edible oil rich in oleic and linoleic acid, it is also used in biofuels, cosmetics, colouring dyes, pharmaceuticals and nutraceuticals. Despite its significant economic uses, availability of genetic and genomic resources in safflower are limited. We report an improved de novo genome assembly of safflower (Safflower_A2). A chromosome-level assembly of 1.15 Gb with telomeres and centromeric repeats, was constructed using PacBio HiFi reads, optical maps, Illumina short reads, and Hi-C sequencing. Safflower_A2 shows better contiguity, completeness, and high-quality annotation than previous assemblies. The assembly was further validated with the help of a single nucleotide polymorphism (SNP)-based linkage map. A genome-wide survey identified genes for comprehensive exploration of disease resistance in the safflower. Employing the de novo genome assembly as a reference, we used resequencing data of a global core-collection of 123 accessions to carry out a SNP-based genome-wide association study, which identified significant associations for several traits, their haplotypes of agronomic value, including seed oil content. Resequencing data was also applied for a pan-genome analysis which provided critical insights into genome diversity identifying an additional ∼11000 genes and their functional enrichment that will be useful for region-specific breeding lines. Our study provides insights into the genomic architecture of safflower by leveraging an improved genome assembly and annotation. Additionally, resources including high-density linkage map, marker-trait associations, and pan-genome developed in this study provide valuable resources for use in breeding and crop improvement programs by the global research community.
- New
- Research Article
- 10.1093/gigascience/giaf150
- Dec 9, 2025
- GigaScience
- Lars Gruber + 3 more
Spatial 'omics techniques are indispensable for studying complex biological systems and for the discovery of spatial biomarkers. While several current matrix-assisted laser desorption/ionization (MALDI) mass spectrometry imaging (MSI) instruments are capable of localizing numerous metabolites at high spatial and spectral resolution, the majority of MSI data is acquired at the MS1 level only. Assigning molecular identities based on MS1 data presents significant analytical and computational challenges, as the inherent limitations of MS1 data preclude confident annotations beyond the sum formula level. To enable future advancements of computational lipid annotation tools, well-characterized benchmark - or ground truth - datasets are crucial, which exceed the scope of synthetic data or data derived from mimetic tissue models. To this end, we provide two sulfatide-centered, biology-driven magnetic resonance MSI (MR-MSI) datasets at different mass resolving powers that characterize lipids in a mouse model of human metachromatic dystrophy. This data includes an ultra-high-resolution (R ∼1,230,000) quantum cascade laser mid-infrared imaging-guided MR-MSI dataset that enables isotopic fine structure analysis and therefore enhances the level of confidence substantially. To highlight the usefulness of the data, we compared 118 manual sulfatide annotations with the number of decoy database-controlled sulfatide annotations performed in Metaspace (67 at FDR < 10%). Overall, our datasets can be used to benchmark annotation algorithms, validate spatial biomarker discovery pipelines, and serve as a reference for future studies that explore sulfatide metabolism and its spatial regulation.
- New
- Research Article
- 10.1093/gigascience/giaf148
- Dec 9, 2025
- GigaScience
- Wei Zhang + 4 more
High-throughput technologies now produce a wide array of omics data, from genomic and transcriptomic profiles to epigenomic and proteomic measurements. Integrating multiple omics layers measured on the same samples can reveal cross-layer molecular hubs that single-layer analyses miss. We present an unsupervised, multivariate random forest (MRF) framework with an inverse minimal depth (IMD) importance to prioritize shared biomarkers across omics. In each forest, one layer serves as a multivariate response and the other as predictors; IMD summarizes how early a predictor (or response MSRV) appears across trees, yielding interpretable, cross-layer feature rankings. We provide three IMD-based selection strategies and introduce an optional IMD power transform to enhance sensitivity to interaction signals. In extensive simulations spanning linear, nonlinear, and interaction regimes, our method matches SPLS/CCA under linear settings and outperforms them as nonlinearity increases, while adapted univariate ensemble learners (RF, GBM, XGBoost) underperform in the multivariate, unsupervised context. Applied to TCGA BRCA and COAD, MRF-IMD identifies genes, CpGs, and miRNAs enriched for cancer-relevant pathways and yields more robust survival stratification than linear integrators with matched model sizes. In a TCGA pan-cancer analysis, MRF-IMD features achieve higher ARI than alternatives and recover coherent tumor-type clusters; in ADNI, the integrative signature improves dementia-progression stratification over a published methylation risk score. Our scalable, interpretable MRF-IMD framework advances reliable multi-omics biomarker discovery when nonlinear, cross-layer dependencies matter.
- New
- Research Article
- 10.1093/gigascience/giaf149
- Dec 8, 2025
- GigaScience
- Stephen R Piccolo + 1 more
With the increasing complexity and quantity of experimental and observational data, life scientists rely on programming to automate analyses, enhance reproducibility, and facilitate collaboration. Scripting languages like Python are often favored for their simplicity and flexibility, enabling researchers to focus primarily on high-level tasks. Compiled languages such as C++ and Rust offer greater efficiency, making them preferable for intensive or repeated computations. In educational settings, instructors may wish to teach both types of languages and thus may wish to translate content from one programming language to another. In research contexts, researchers may wish to implement their ideas in one language before translating the code to another. However, translating between programming languages requires significant effort, prompting our interest in using large language models (LLMs) for semi-automated code translation. This study explores the use of an LLM (GPT-4) to translate 559 short-form programming exercises from Python into C++, Rust, Julia, and JavaScript. We used three prompting strategies-instructions only, code only, or both combined-and compared the translated code's output against the Python code's output. Translation success differed considerably by prompting strategy, and at least one of the strategies tested was effective for nearly every exercise. The highest overall success rate occurred for Rust (99.5%), followed by JavaScript (98.9%), C++ (97.9%), and Julia (95.0%). Our findings demonstrate that LLMs can effectively translate small-scale programming exercises between languages, reducing the need for manual rewriting. To support education and research, we have manually translated all exercises that were not translated successfully through automation, and we have made our translations freely available.
- New
- Research Article
- 10.1093/gigascience/giaf137
- Dec 5, 2025
- GigaScience
- Xiuyun Liu + 12 more
The world has witnessed a steady rise in neurological diseases, which represent a heterogeneous group of disorders characterized by complex pathogenesis involving disruptions at multiple molecular levels, including genomic, transcriptomic, proteomic, and metabolomic levels. These disorders, often caused by genetic mutations, metabolic imbalances, immune dysregulation, and environmental factors, pose significant challenges to global public health due to their high prevalence, mortality, and disability burden. The advent of high-throughput technologies, such as next-generation sequencing and mass spectrometry, has provided valuable insights into the underlying mechanisms of disease, especially the development of multi- and high-spatial-resolution omics technologies, enabling the interaction of multiple levels of biology and analysis of the complex molecular networks and pathophysiological processes. This review provides a comprehensive analysis of the latest advancements in multi- and high-spatial-resolution omics, with a focus on their applications in precision diagnostics, biomarker discovery, and therapeutic target identification in brain diseases. The study also highlights the current challenges in the clinical implementation and discusses the future directions, with artificial intelligence being anticipated to enhance clinical translation and diagnostic accuracy significantly.
- New
- Research Article
- 10.1093/gigascience/giaf147
- Dec 5, 2025
- GigaScience
- Mahnaz Mohammadi + 12 more
Whole slide imaging (WSI) enables the digitisation of entire histological slides at high resolution, allowing pathologists and researchers to analyse tissue samples digitally rather than through traditional microscopy. This technology has become increasingly valuable in pathology for research, education, and clinical diagnostics. Endometrial biopsy is very common, often being undertaken to exclude non-cancerous disease. This means that most cases do not contain cancer, and the challenge is to accurately and efficiently exclude serious pathology rather than simply make a diagnosis of malignancy. A well-curated, expert-annotated, endometrial whole slide dataset covering a spread of cancer and non-cancer diagnoses will support machine learning applications in automated diagnosis, facilitate research into the pathology of endometrial cancer, and serve as an educational resource for medical professionals. We introduce a newly constructed, large-scale dataset of endometrial biopsies, comprising 2,909 whole slide images in iSyntax format, each accompanied by a corresponding annotation file in JSON format. Each whole slide image is labelled with a primary class label representing its final diagnosis and a sub-category label providing further details within that diagnostic class. These class labels are critical for machine learning applications, as they enable the development of AI models capable of distinguishing between different types of endometrial abnormalities, improving automated classification, and guiding clinical decision-making. Constructing and curating a high-quality endometrial whole slide dataset requires significant effort to ensure accurate annotations, data integrity, and patient privacy protection. However, the availability of a well-annotated dataset with detailed class labels is crucial for advancing digital pathology. Such a resource can enhance diagnostic accuracy, support personalized treatment strategies, and ultimately improve outcomes for patients with endometrial cancer and other endometrial conditions.
- New
- Research Article
- 10.1093/gigascience/giaf121
- Dec 4, 2025
- GigaScience
- Venkataramana Kopalli + 4 more
BackgroundPangenomes are crucial for understanding species-wide genetic diversity, delineating core and variable genes. This study compares 3 key pangenome graph assembly pipelines: Minigraph, PGGB, and Minigraph-Cactus, using publicly available Sorghum data. We introduce tailored metrics for comprehensive pangenome graph evaluation, including completeness, duplication levels, and fidelity of structural variants.ResultsBy assessing the tools on Sorghum datasets, we gauge their efficacy in handling diverse genomic features. The analysis provides detailed insights into the strengths and limitations of Minigraph, PGGB, and Minigraph-Cactus, aiding researchers in informed tool selection. The metrics developed contribute to standardizing pangenome graph assessments, enabling robust and objective tool comparisons. We further demonstrate the utility of the metrics by applying them to pangenome graphs of 3 crops: soybean, barley, and oilseed rape.ConclusionsThis benchmarking study advances our understanding of pangenome assembly tools and establishes a foundation for standardized evaluation metrics. We plan to further use these insights to optimize tool selection for specific applications, such as genome-wide association studies, improving the accuracy of downstream analyses.
- New
- Research Article
- 10.1093/gigascience/giaf145
- Nov 29, 2025
- GigaScience
- Joanna Szablińska-Piernik + 2 more
The liverwort A. endiviifolia, a dioicous, simple thalloid species, is notable for its cryptic diversity, habitat adaptability, genomic innovation, and represents a clade that is sister to all other Jungermanniopsida. These features make A. endiviifolia an essential model for exploring speciation mechanisms and the evolution of genome structures within liverworts. We present the genome assembly of a haploid A. endiviifolia isolate with a total size of 2,914,960,273 bp and an N50 of 468,157,909 bp, demonstrating high completeness (99.2% BUSCO) and a high consensus quality (QV 47.6). The assembly consisted of nine chromosomes, which included 18 telomeres and nine centromeres (ranging from 1.9 to 5 Mbp in length). RNA-seq-based annotation identified 34,615 genes, predominantly protein-coding. The TEs comprised 12.16% LTR elements and 57 Helitrons. Among the retroelements, the Copia and Gypsy superfamilies comprised 8.94% and 2.95% of the genome, respectively. The Ty3/Gypsy superfamily was found to be significantly enriched in centromeric regions. The average GC content ranged from 38.8% to 39.6%, with gene density varied between a value 5.52 and 9.78 genes per 500 kbp. Synteny analysis of related liverwort species has revealed complex chromosomal relationships, indicating extensive genome rearrangements among species. This study provides the first high-quality reference genome assembly of the haploid liverwort A. endiviifolia. Assembly and annotation offer valuable resources for investigating liverwort evolution, centromere biology, and genome expansion in simple thalloid liverworts.