Published in last 50 years
Related Topics
Articles published on Source Datasets
- New
- Research Article
- 10.1016/j.jmb.2025.169532
- Nov 1, 2025
- Journal of molecular biology
- Gongwei Chen + 4 more
HGCJAMH: A Method for circRNA-Drug Sensitivity Prediction Based on Higher-Order Moment-Guided Model and Hypergraph Jumping Learning Mechanism.
- New
- Research Article
- 10.1016/j.ultrasmedbio.2025.09.011
- Nov 1, 2025
- Ultrasound in medicine & biology
- Nor Haqkiem + 7 more
BreAST-U²Net: A Twin-Stream U2Net with Attention-based Tumor Fusion for 2-D Tumor Segmentation in Automated Breast Ultrasound.
- New
- Research Article
- 10.1080/10618600.2025.2581761
- Oct 30, 2025
- Journal of Computational and Graphical Statistics
- Yichen Lou + 2 more
High-dimensional interval-censored failure time data occur in many areas and many methods have been proposed for their regression analysis. However, these methods may fail or not perform well when the available information is limited. To address this, we propose two transfer learning estimation procedures that can take into account multiple so-called source data under the framework of semiparametric linear transformation models, which are commonly used and well-known for their flexibility. The first one is a data-driven source detection procedure that allows one to classify the source data into two groups, positive and negative transfers, and perform the transfer learning estimation based on the combination of all of the positive transfers. Then a model-averaging approach is developed with the adaptive weights to source datasets determined based on their relevance to the target task. The asymptotic properties of the resulting estimators including the consistency are provided. An extensive simulation is conducted and demonstrates the superior performance of the proposed methods in terms of estimation accuracy and predictive capability. Finally they are applied to a breast cancer data that motivated this study.
- New
- Research Article
- 10.1145/3773765
- Oct 29, 2025
- ACM Transactions on Multimedia Computing, Communications, and Applications
- Mingqiang Wei + 5 more
Searching by image is popular yet still challenging in e-commerce due to the extensive interference arose from i) data variations (e.g., background, pose, visual angle, brightness) of real-world captured images and ii) similar images in the query dataset. This paper studies a practically meaningful problem of beauty product retrieval (BPR) by neural networks. We broadly extract different types of image features, and raise an intriguing question that whether these features are beneficial to i) suppress data variations of real-world captured images, and ii) distinguish one image from others which look very similar but are intrinsically different beauty products in the dataset, therefore leading to an enhanced capability of BPR. To answer it, we present a novel v ariable-attention neural network to understand the combination of m ultiple features (termed VM-Net) of beauty product images. Considering that there are few publicly released training datasets for BPR, we establish a new dataset with more than one million images classified into more than 20K categories to improve both the generalization and anti-interference abilities of VM-Net and other methods. We verify the performance of VM-Net and its competitors on the benchmark dataset Perfect-500K, where VM-Net shows clear improvements over the competitors in terms of \(MAP@7\) . The source code and dataset will be released upon publication.
- New
- Research Article
- 10.1093/bioadv/vbaf263
- Oct 29, 2025
- Bioinformatics Advances
- Raziyeh Masumshah + 1 more
Abstract Integrating heterogeneous biological data is a central challenge in bioinformatics, especially when modeling complex relationships among entities such as drugs, diseases, and molecular features. Existing methods often rely on static or separate feature extraction processes, which may fail to capture interactions across diverse feature types and reduce predictive accuracy. To address these limitations, we propose PSO-FeatureFusion, a unified framework that combines particle swarm optimization with neural networks to jointly integrate and optimize features from multiple biological entities. By modeling pairwise feature interactions and learning their optimal contributions, the framework captures individual feature signals and their interdependencies in a task-agnostic and modular manner. We applied PSO-FeatureFusion to two bioinformatics tasks—drug-drug interaction and drug-disease association prediction—using multiple benchmark datasets. Across both tasks, the framework achieved strong performance across evaluation metrics, often outperforming or matching state-of-the-art baselines, including deep learning and graph-based models. The method also demonstrated robustness with limited hyperparameter tuning and flexibility across datasets with varying feature structures. PSO-FeatureFusion provides a scalable and practical solution for researchers working with high-dimensional biological data. Its adaptability and interpretability make it well-suited for applications in drug discovery, disease prediction, and other bioinformatics domains. The source code and datasets are available at https://github.com/raziyehmasumshah/PSO-FeatureFusion.
- New
- Research Article
- 10.1007/s10664-025-10749-4
- Oct 29, 2025
- Empirical Software Engineering
- Radowanul Haque + 3 more
Abstract The growing prevalence of software vulnerabilities has increased the need for effective detection methods, particularly in cross-project settings where domain differences create significant challenges. Existing vulnerability detection models often struggle to generalise across projects due to variations in coding styles, feature distributions, and the absence of labelled target data. This paper presents ZSVulD, a zero-shot, cross-project vulnerability detection framework designed to operate without target-domain labels. ZSVulD uses domain-agnostic CodeBERT embeddings to capture both syntactic and semantic features of source code, enabling knowledge transfer between projects. The framework applies an iterative pseudo-labelling process in which a neural network and XGBoost classifier collaboratively refine predictions for the target domain. Feature alignment is incorporated as a diagnostic technique to assess and visualise distributional differences between source and target datasets. Experiments on the Devign and REVEAL datasets show that ZSVulD achieves higher recall, F1, and F2 scores compared to existing methods, with an emphasis on reducing false negatives. These findings indicate that ZSVulD can support automated vulnerability detection pipelines, contributing to more reliable security assessments across different software projects.
- New
- Research Article
- 10.1186/s12911-025-03055-y
- Oct 29, 2025
- BMC medical informatics and decision making
- Jakub J Dylag + 2 more
In clinical research, there is a strong drive to leverage big data from population cohort studies and routine electronic healthcare records to design new interventions, improve health outcomes and increase the efficiency of healthcare delivery. However, realising these potential demands requires substantial efforts in harmonising source datasets and curating study data, which currently relies on costly, time-consuming and labour-intensive methods. We explore and assess the use of natural language processing (NLP) and unsupervised machine learning (ML) to address the challenges of big data semantic harmonisation and curation. Our aim is to establish an efficient and robust technological foundation for the development of automated tools supporting data curation of large clinical datasets. We propose two AI based pipelines for automated semantic harmonisation: a pipeline for semantics-aware search for domain relevant variables and a pipeline for clustering of semantically similar variables. We evaluate pipeline performance using 94,037 textual variable descriptions from the English Longitudinal Study of Ageing (ELSA) database. We observe high accuracy of our Semantic Search pipeline, with an AUC of 0.899 (SD = 0.056). Our semantic clustering pipeline achieves a V-measure of 0.237 (SD = 0.157), which is on par with that of leading implementations in other relevant domains. Automation can significantly accelerate the process of dataset harmonisation. Manual labelling was performed at a speed of 2.1 descriptions per minute, with our automated labelling increasing speed to 245 descriptions per minute. Our study findings underscore the potential of AI technologies, such as NLP and unsupervised ML, in automating the harmonisation and curation of big data for clinical research. By establishing a robust technological foundation, we pave the way for the development of automated tools that streamline the process, enabling health data scientists to leverage big data more efficiently and effectively in their studies and accelerating insights from data for clinical benefit.
- New
- Research Article
- 10.1038/s41598-025-21384-w
- Oct 27, 2025
- Scientific Reports
- Dan Komosny
Machine learning-based phishing detection is crucial for preventing zero-day attacks. State-of-the-art phishing detection performs well on English webpages. However, it is not adequately accurate for webpages in minor languages. This work significantly improves phishing detection in European countries where minor languages are predominantly spoken. The proposed language-based phishing detection outperforms state-of-the-art methods for webpages in minor languages, achieving 99% accuracy on local webpages in the following countries: the Czech Republic, Denmark, Estonia, Croatia, Hungary, Lithuania, Latvia, Poland, Romania, Serbia, Slovakia, and Slovenia. The main improvement lies in reducing the false positive rate, where local benign webpages are incorrectly identified as phishing. The proposed method reduces the false positive rate by up to a factor of 10 for webpages in minor languages. The results are statistically robust across different webpage sets, scaled for real-world use. The Shapiro-Wilk test confirms a normal distribution of the results in each tested country, with p-values consistently above 0.05. Paired T-tests yield p-values consistently below 0.05, indicating statistically significant improvements in each country. The low variability in the results further demonstrates the robustness of the proposed method. To foster reproducibility, the source code, datasets, and raw results are made publicly available, including approximately two million local webpages from 16 countries. Reference data from English-speaking and major-language-speaking countries are also included. This work advances phishing detection for webpages in minor languages and contributes to more inclusive global cybersecurity.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-21384-w.
- New
- Research Article
- 10.1007/s11548-025-03536-5
- Oct 27, 2025
- International journal of computer assisted radiology and surgery
- Renzhe Tu + 6 more
Coronary artery disease is a major global cause of morbidity and mortality, especially in obstructive CAD patients. Precise segmentation of coronary arteries and atherosclerotic plaques is essential for effective treatment. However, no previous study has addressed the joint segmentation of these two within a unified framework, which motivates our work. We built a dataset, namely PCCTA120, consisting of 120 CCTA volumes, each annotated with manually delineated masks for coronary arteries and atherosclerotic plaques. We then present Mask SAM 3D, an innovative framework designed for joint segmentation of these two components. In the context of plaque localization within coronary arteries, accurately identifying plaques is a complex task due to the intricate nature of coronary artery and the subtle differences in plaque appearance. To simplify this challenge, we recognized the need for a reliable prior of well-defined artery skeleton and proposed to first generate a precise coronary artery mask with nnUNet. Subsequently, a novel plaque-aware adapter is developed to intensify semantic interactions and refines the accuracy of plaque localization by capitalizing on the prior information embedded within the generated coronary artery mask. Meanwhile, to enhance the model's discriminative ability for accurate joint segmentation, a prototype-guided prediction module that dynamically clusters embedded features into class-specific prototypes is introduced. Experiments conducted on our self-built dataset show that our method achieves Dice similarity coefficients of 84.5% for artery segmentation and 55.2% for plaque segmentation, outperforming current state-of-the-art methods. First, we release a new coronary arteries and atherosclerotic plaques segmentation dataset, PCCTA120, to advance the cardiovascular research community. Meanwhile, our framework, Mask SAM 3D, cannot only improve the accuracy of artery segmentation but also enhances that of plaque segmentation. Source code and dataset will be made publicly available.
- New
- Research Article
- 10.1371/journal.pcbi.1013606
- Oct 24, 2025
- PLOS Computational Biology
- Jeremy Charlier + 2 more
Transfer learning has emerged as a powerful tool for enhancing predictive accuracy in complex tasks, particularly in scenarios where data is limited or imbalanced. This study explores the use of similarity-based pre-evaluation as a methodology to identify optimal source datasets for transfer learning, addressing the dual challenge of efficient source-target dataset pairing and off-target prediction in CRISPR-Cas9, while existing transfer learning applications in the field of gene editing often lack a principled method for source dataset selection. We use cosine, Euclidean, and Manhattan distances to evaluate similarity between the source and target datasets used in our transfer learning experiments. Four deep learning network architectures, i.e. Multilayer Perceptron (MLP), Convolutional Neural Networks (CNNs), Feedforward Neural Networks (FNNs), and Recurrent Neural Networks (RNNs), and two traditional machine learning models, i.e. Logistic Regression (LR) and Random Forest (RF), were tested and compared in our simulations. The results suggest that similarity scores are reliable indicators for pre-selecting source datasets in CRISPR-Cas9 transfer learning experiments, with cosine distance proving to be a more effective dataset comparison metric than either Euclidean or Manhattan distances. An RNN-GRU, a 5-layer FNN, and two MLP variants provided the best overall prediction results in our simulations. By integrating similarity-based source pre-selection with machine learning outcomes, we propose a dual-layered framework that not only streamlines the transfer learning process but also significantly improves off-target prediction accuracy. The code and data used in this study are freely available at: https://github.com/dagrate/transferlearning_offtargets.
- New
- Research Article
- 10.1093/gigascience/giaf136
- Oct 24, 2025
- GigaScience
- Haimei Wen + 5 more
Predicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
- New
- Research Article
- 10.1007/s00799-025-00433-9
- Oct 18, 2025
- International Journal on Digital Libraries
- Ali Abdari + 2 more
Abstract Every day, social media and content sharing platforms receive numerous training videos across a multitude of sectors, including home gardening, agriculture, and indoor farming. Recently, the Metaverse has offered the opportunity to create enclosed virtual spaces where a user can focus and learn specialized skills, e.g., medical, in an easy, interactive, and engaging way. However, finding suitable educational spaces for specific themes remains a challenge. In light of these advancements and to support agricultural education, in this work, we introduce AgriMus project, which aims to design agricultural-themed museums covering a broad range of topics. Users can explore and retrieve specific educational experiences through visual or textual queries. Beyond creating a dataset for the agricultural setting, we also propose a hierarchical method to model these virtual museums. Several experiments have been conducted to evaluate the effectiveness of the proposed approach. The full source code and dataset used in this study are available at: https://github.com/aliabdari/AgriMus/.
- Research Article
- 10.1080/01621459.2025.2555057
- Oct 16, 2025
- Journal of the American Statistical Association
- Seyoung Park + 3 more
In high-dimensional multiple response regression problems, the large dimensionality of the coefficient matrix poses a challenge to parameter estimation. To address this challenge, low-rank matrix estimation methods have been developed to facilitate parameter estimation in the high-dimensional regime, where the number of parameters increases with sample size. Despite these methodological advances, accurately predicting multiple responses with limited target data remains a difficult task. To gain statistical power, the use of diverse datasets from source domains has emerged as a promising approach. In this article, we focus on the problem of transfer learning in a high-dimensional multiple response regression framework, which aims to improve estimation accuracy by transferring knowledge from informative source datasets. To reduce potential performance degradation due to the transfer of knowledge from irrelevant sources, we propose a novel transfer learning procedure including the forward selection of informative source sets. In particular, our forward source selection method is new compared to existing transfer learning framework, offering deeper theoretical insights and substantial methodological innovations. Theoretical results show that the proposed estimator achieves a faster convergence rate than the single-task penalized estimator using only target data. In addition, we develop an alternative transfer learning based on non-convex penalization to ensure rank consistency. Through simulations and real data experiments, we provide empirical evidence for the effectiveness of the proposed method and for its superiority over other methods. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
- Research Article
- 10.59934/jaiea.v5i1.1675
- Oct 15, 2025
- Journal of Artificial Intelligence and Engineering Applications (JAIEA)
- Robet + 2 more
Indonesian, as the country's official language, is crucial in both academic and professional settings. Therefore, writing well and adhering to grammatical standards is crucial. However, many grammatical errors persist in various types of writing. The objective of this research is to design and develop a web-based application that can automatically identify grammatical issues in Indonesian using machine learning techniques, specifically the Support Vector Machine (SVM). The SVM algorithm was chosen for its high accuracy in text classification. An Indonesian dictionary was used as the source dataset. This program can be used as a learning tool in addition to helping users identify and correct grammatical errors in real-time. With 100% accuracy, precision, and recall values, and 0% classification error, the test results demonstrate the application's excellent detection performance. These results demonstrate how well the SVM system is able to detect grammatical issues in Indonesian text.
- Research Article
- 10.1186/s13059-025-03819-9
- Oct 13, 2025
- Genome Biology
- Wei Su + 9 more
BackgroundPromoters, as essential cis-regulatory elements in prokaryotes, govern gene expression by mediating RNA polymerase binding through core motifs and long-range regulatory interactions, playing a pivotal role in cell metabolism and environmental adaptation. Hence, accurate identification of prokaryotic promoters is vital for understanding their biological functions. However, the existing tools for predicting prokaryotic promoters are mainly concentrated on individual model organisms, and their prediction accuracy needs to be further improved. To address these gaps, we develop iPro-MP, a transformer-based prokaryotic promoter prediction framework that we systematically evaluate across 23 phylogenetically diverse species, including both model and non-model organisms.ResultsiPro-MP utilizes a multi-head attention mechanism to capture textual information in DNA sequences and effectively learns the hidden patterns. Cross-species prediction demonstrates the necessity of constructing species-specific models. Through a series of experiments, iPro-MP shows outstanding performance, with the AUC exceeding 0.9 in 18 out of 23 species.ConclusionsOur novel approach to predicting prokaryotic promoters, iPro-MP, provides the superiority to other existing tools, especially in predicting non-model organisms. Finally, for the convenience of other researchers, the source code and datasets of iPro-MP are freely available at https://github.com/Jackie-Suv/iPro-MP.Supplementary InformationThe online version contains supplementary material available at 10.1186/s13059-025-03819-9.
- Research Article
- 10.1016/j.dib.2025.112137
- Oct 9, 2025
- Data in Brief
- Reynold Osuna-González + 3 more
Structured RDF dataset for genomic feature extraction and detection of biotechnological microorganisms of the Burkholderia genus
- Research Article
- 10.3389/fgene.2025.1650244
- Oct 3, 2025
- Frontiers in Genetics
- Jiawei Wang + 4 more
IntroductionFungal identification through ITS sequencing is pivotal for biodiversity and ecological studies, yet existing methods often face challenges with high-dimensional features and inconsistent taxonomy predictions.MethodWe proposed HFTC, a hierarchical fungal taxonomic classifier built upon a multi-level random forest (RF) architecture. Notably, HFTC incorporates a bidirectional k-mer strategy to capture contextual information from both sequence orientations. By leveraging Word2Vec embedding, it reduces feature dimensionality from 4k to only 200, significantly improving computational efficiency while preserving rich sequence context.ResultExperimental results demonstrate that HFTC outperforms Mothur, RDP, Sintax, QIIME2, and CNN-Duong, achieving a Matthews correlation coefficient (MCC) of 95.31% despite uneven class distributions. Its overall accuracy (ACC) reaches 95.25%. At the species level, it attains a hierarchical accuracy (HA) of 95.10%, surpassing the best-performing deep learning baseline, CNN-Duong, by 3.2%. Moreover, HFTC exhibits the smallest discrepancy between ACC and HA (1.60%), in contrast to CNN-Duong, which shows the largest gap (35.00%), highlighting HFTC’s superior hierarchical consistency.DiscussionHFTC offers a scalable and accurate approach for fungal taxonomic classification. Its compact feature representation and hierarchical architecture make it particularly suitable for microbial diversity research. The source code and datasets are publicly accessible at https://github.com/wjjw0731/HFTC/tree/master.
- Research Article
- 10.1108/ijsi-02-2025-0043
- Oct 2, 2025
- International Journal of Structural Integrity
- Feng Jia + 6 more
Purpose To address the challenge of model training difficulties caused by the scarcity of labeled training samples in practical applications, this study fully leverages the combination of simulation and real data for fault diagnosis. Design/methodology/approach A simulation-reality domain mixup adaptation method (SR-DMA) is proposed for cross-domain bearing fault diagnosis. Firstly, a bearing fault simulation model in a non-stationary state is established to generate simulation data, which is used as the source dataset. Secondly, the domain mixup adaptation method is developed to enhance the performance of intelligent fault diagnosis by utilizing class-aware information. Findings The effectiveness and practicality of SR-DMA are validated by two bearing cases. The results show that SR-DMA can fully adapt to the deep feature distribution of simulation and reality data, improving the accuracy of bearing fault diagnosis compared to other methods. Originality/value (1) A simulation-reality domain mixup adaptation method (SR-DMA) is proposed for cross-domain bearing fault diagnosis. (2) A bearing fault simulation model in a non-stationary state is established to generate simulation data. (3) The domain mixup adaptation method is developed to enhance the performance of intelligent fault diagnosis by utilizing class-aware information.
- Research Article
- 10.1002/gdj3.70038
- Oct 1, 2025
- Geoscience Data Journal
- S R Smith + 3 more
ABSTRACT Bulk turbulent heat and momentum fluxes are derived from individual marine reports from ships and moored buoys. The source dataset is the International Comprehensive Ocean–Atmosphere Data Set (ICOADS), specifically release 3.1.0 (1990–2014) and release 3.0.2 (2015–2020). Prior to flux calculation, the ICOADS data undergo extensive quality control to remove suspect observations. Fluxes are calculated using three bulk algorithms well known to the air‐sea interaction community. The ships and moorings used to create the fluxes are globally distributed, with a higher concentration along primary shipping lanes and within the tropical oceans. A brief overview of each flux product is provided along with information on how to access the data from the National Science Foundation National Center for Atmospheric Research and via the MarineFlux ERDDAP service. Applications of the ICOADS MarineFlux potentially include validating fluxes from numerical models and satellite‐based wind and flux products. The flux dataset could be used in developing new gridded analyses and has the potential to be used to assess variations in air‐sea energy exchange between 1990 and 2020. All MarineFlux products are freely available for use and reuse, with no restrictions other than a request to cite the source.
- Research Article
- 10.1186/s12859-025-06254-6
- Oct 1, 2025
- BMC Bioinformatics
- Sih-Han Chen + 5 more
BackgroundPeptides have emerged as promising therapeutic agents for drug development against cancer, immune disorders, hypertension, and microbial infections. Peptide drugs have the advantage of high selectivity, low production cost, and fewer side effects compared to traditional small molecule-based drugs. However, one main challenge that hinders the adoption of peptide therapeutics is that some peptides are prone to be hemolytic, leading to the disruption of erythrocyte membranes and decreasing the life span of red blood cells. A computational model for hemolytic peptide identification would be a valuable tool for peptide drug discovery.ResultsIn this study, we present HEPAD, a machine learning predictor to identify hemolytic peptides based on adaptive feature engineering and diverse sequence descriptors. Sequence descriptors were applied for feature encoding, generating a feature vector of nearly 4000 numeric values for each peptide. Next, an adaptive feature engineering method was proposed to produce a customized feature subset for a given dataset. The four datasets considered in this study were associated with 250, 350, 90, and 130 selected features. Five machine learning methods of different rationale were employed to perform cross validation and independent tests. HEPAD yields Matthew’s correlation coefficients (MCCs) of 0.973, 0.643, and 0.609, respectively, for three independent datasets. The improvements in MCC compared to existing approaches range from 1.9 to 13.3% for three independent tests. Moreover, data visualization reveals that the customized feature subsets can effectively separate hemolytic peptides from random peptides.ConclusionsHEPAD offers efficient identification of potential hemolytic peptides, thereby expediting experimental procedures in drug discovery. The source code, datasets, and machine learning models are available at https://github.com/csh07/HEPAD.Supplementary InformationThe online version contains supplementary material available at 10.1186/s12859-025-06254-6.