Articles published on Data curation
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
3394 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.ympev.2026.108574
- Jun 1, 2026
- Molecular phylogenetics and evolution
- Carlos G Schrago
Efficient identification of phylogenetically informative alignment sites via sparse learning.
- New
- Research Article
- 10.1016/j.sbi.2026.103257
- Jun 1, 2026
- Current opinion in structural biology
- Sukrit Singh + 3 more
More protein-ligand data are needed for AlphaFold-like models to enable drug discovery.
- New
- Research Article
- 10.1016/j.dib.2026.112721
- Jun 1, 2026
- Data in brief
- Indira R Guzman + 9 more
Introducing "ELLAS Survey Dataset" an open resource about factors that influence career interest and leadership in STEM in Bolivia, Brazil, and Peruo.
- New
- Research Article
- 10.47852/bonviewjcce62027652
- May 20, 2026
- Journal of Computational and Cognitive Engineering
- Monir Hossain + 4 more
Deep learning classifies medicinal plants, driven by the need to preserve traditional knowledge and automate identification for practical uses. This review extensively summarizes 30 recent studies (2021–June 2025) on applying deep learning, primarily using image data, to classify medicinal plants. This review analyzes research distribution, dataset preparation, image preprocessing, augmentation, and deep learning architectures like convolutional neural networks, Vision Transformers, and hybrid models. Our analysis reveals a strong geographic focus, with 50% of the selected studies originating from India and Bangladesh. The focus is overwhelmingly on leaf imagery, with 29 out of the 30 studies relying on this approach. The field is also characterized by its dependence on existing data, as 56.6% of studies utilized public datasets and another 26.6% employed a hybrid of public and private data, with dataset sizes ranging from a minimum of 637 to a maximum of 13,500 images. Methodologically, the vast majority of studies rely on a transfer learning approach (36.7%), achieving robust accuracy rates between 74% and 99.9%. Furthermore, we recognize significant limitations, such as the absence of standardized and diverse datasets, insufficient inclusion of uncommon or endangered species, and inadequate representation of whole-plant imaging. The research underscores the necessity for collaborative, multidisciplinary initiatives to develop centralized, high-quality, and geographically comprehensive datasets. We delineate prospective avenues, including multimodal feature integration, the development of real-world applications, and optimization for privacy-preserving frameworks such as federated learning. This study guides academics advancing deep learning for medicinal plant classification and biodiversity conservation. Received: 13 September 2025 | Revised: 8 December 2025 | Accepted: 5 March 2026 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement Data sharing is not applicable to this article as no new data were created or analyzed in this study. Author Contribution Statement Monir Hossain: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision. Fahmid Al Farid: Validation, Formal analysis, Investigation, Resources, Writing – review & editing, Visualization. Momotaz Begum: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – review & editing, Visualization. Jia Uddin: Conceptualization, Validation, Formal analysis, Investigation, Resources, Writing – review & editing, Visualization, Supervision, Project administration. Hezerul Bin Abdul Karim: Validation, Formal analysis, Investigation, Resources, Writing – review & editing, Visualization, Funding acquisition.
- New
- Research Article
- 10.1080/0361526x.2026.2669507
- May 16, 2026
- The Serials Librarian
- Surbhi Arora + 1 more
ABSTRACT This study investigates the Research Data Management (RDM) skillsets and competencies of library professionals within North Indian university libraries. Employing a survey method, data was collected from professionals, resulting in an 84% response rate. The data, analyzed using SPSS. The study reveals that a significant number of library professionals require further development in several key areas. These include awareness of new government initiatives (e.g. Shodh Chakra), proficiency in metadata creation (e.g. using Dublin Core standards), and advanced technical ICT skills (e.g. data storage infrastructure and architecture). Additionally, there is a need for enhanced subject-specific knowledge, familiarity with various research methods (e.g. data analysis and visualization), and competencies in data description, documentation, and curation. The study also highlights the importance of skills related to developing Data Management Plans (DMPs), staying current with RDM trends through refresher courses, utilizing originality checking tools, and understanding legal, policy, and advisory aspects (e.g. intellectual property, ethics, and licensing). This research underscores the critical need for ongoing professional development to equip library professionals with the necessary competencies to effectively support RDM practices in academic environments.
- New
- Research Article
- 10.1001/jamanetworkopen.2026.12759
- May 15, 2026
- JAMA Network Open
- Kirstin Faust + 31 more
The effects of contact precautions (ie, gowns and gloves) for individual patients colonized with gram-negative (GN) drug-resistant bacteria on sepsis risk in neonates requiring intensive care remain to be clarified. To evaluate the noninferiority of standard hand hygiene disinfection vs standard hygiene disinfection plus extended barrier precautions for infants colonized with third-generation cephalosporins-resistant GN bacteria (3GCR-GNB). This cluster-randomized clinical trial was conducted from 2020 to 2023 in 12 German tertiary care neonatal intensive units caring for neonates with high risk for infections with GNB for 24 months, with crossover after 12 months. Follow-up and data curation were completed December 31, 2024, and statistical analysis was finalized on July 31, 2025. The intervention was standard hand hygiene disinfection compared with current recommendations, ie, hygiene disinfection plus extended barrier precautions with gowns and gloves for routine care of infants colonized with 3GCR-GNB. The primary outcome was the rate of health care-associated GNB bloodstream infections (BSI) at infant level in all neonates requiring intensive care in the cluster, assuming 5% as noninferiority margin delta; secondary outcomes included transmission rates of 3GCR-GNB and rates of any infection. The primary analysis was based on an overall sample size of 12 sites with crossover at 12 months, making 24 clusters with 9731 neonates. During the standard hand hygiene disinfection periods, 22 of 4699 infants (0.5%) developed GNB BSIs at infant level, compared with 25 of 5032 infants (0.5%) cared for during the extended barrier precaution periods (risk difference [RD], -0.03%; 95% CI, -0.43% to 0.38%; noninferiority P < .001). At least 1 nosocomial transmission with 3GCR-GNB was noted during 41 of 144 months in the intervention period and 54 of 144 months in the control period (RD, -9.03%; 95% CI, -27.79% to 9.74%), with involvement of 116 patients (2.5%) vs 149 patients (3.0%) (RD, -0.44%, 95% CI, -2.47% to 1.58%). The total rate of BSI was 2.1% in neonates during the intervention period vs 2.0% during the control period (RD, 0.12%; 95% CI, -1.39% to 1.64%). In this cluster-randomized clinical trial, standard hand hygiene disinfection for the care of infants colonized with 3GC-GNB was noninferior to standard hygiene disinfection plus extended barrier precautions. German Clinical Trials Register identifier: DRKS00019103.
- New
- Research Article
- 10.1371/journal.pcbi.1014236
- May 13, 2026
- PLOS Computational Biology
- Ji Lv + 2 more
In recent years, artificial intelligence (AI) has increasingly influenced daily life and scientific research. Traditionally, AI-related courses have targeted computer science majors, while systematic instructional opportunities for early-stage undergraduates from non-computing backgrounds remain limited. To bridge this gap, we developed an AI course that integrates project-based learning with large language models (LLMs). Specifically, we designed four progressive assignments based on our research project (i.e., drug–drug interaction network clustering analysis). The course does not require prior knowledge of pharmacology or programming. Instead, LLMs are used as assistive tools to support programming, data analysis, and result interpretation. Students engage in a complete workflow, including data curation, algorithm implementation, and critical evaluation of results. Preliminary feedback shows that this approach supports the development of problem-solving skills and increases student engagement. This study provides a framework for integrating LLMs into project-based learning. We believe that this teaching proposal will be valuable and inspiring for educators seeking to design or enrich similar courses.
- New
- Research Article
- 10.1080/14616688.2026.2671422
- May 12, 2026
- Tourism Geographies
- Jorge Costa + 3 more
Crowdsourced data, such as that from mobile fitness apps (MFAs), has the potential to transform tourism research. This opportunity is particularly valuable as tourism to protected areas (PAs) increases, making it more difficult to manage their environmental impacts. However, such data sources come with challenges. Therefore, their representativeness should be carefully evaluated, particularly as their use continues to grow. We use the Paiva Walkways (Portugal) to assess MFAs as a tourism proxy using Spearman’s rank correlation, to evaluate kernel density and fishnet methods for spatialising visitor movements, and to analyse the visitors’ spatiotemporal behaviour using GIS, kernel density estimation, and frequency distribution analysis. Our findings show that the digital records of MFAs correlate with the analogue ticket records; that both kernel density and fishnet methods are effective, but parameter selection affects the detail level and computing processing time; and that visitation to the Paiva Walkways exhibits temporal fluctuations, with higher numbers during summer, August, and on weekends. MFAs effectively represent visitors’ spatiotemporal behaviour, for example, by measuring peak visitation periods. However, MFA data do not increase proportionally with ticket data, suggesting that high-pressure visiting periods may be underrepresented in MFAs. Overall, MFAs provide detailed data that offer new opportunities for tourism geographies research. MFAs data can be used to quantify tourism in PAs, to understand how visitors use space and interact with the surrounding environment, and to analyse how these spatial patterns change over time. MFAs also excel at detecting unauthorised activities, but require careful data curation and a transparent explanation of the selection approach to achieve high-quality, consistent, and replicable results. MFA-based monitoring is an important proxy that expands the information available to stakeholders and park managers about tourism and leisure activities, thereby enabling effective management and promoting sustainable tourism.
- Research Article
- 10.64898/2026.05.06.722876
- May 9, 2026
- bioRxiv : the preprint server for biology
- Melanie Ganz + 19 more
Molecular neuroimaging with positron emission tomography (PET) and single-photon emission computed tomography (SPECT) enables quantification of specific molecular targets in the living brain. Despite its scientific impact, molecular neuroimaging research has historically faced challenges due to high costs, small sample sizes, laboratory-specific analysis pipelines, and limited large-scale data sharing. These factors have hindered reproducibility and the broader reuse of valuable PET datasets. The OpenNeuroPET initiative was established to address these barriers by developing standards, infrastructure, and open-source tools for organizing, sharing, and analyzing molecular neuroimaging data. Through collaborations across Europe and North America, OpenNeuroPET has supported the PET extension of the Brain Imaging Data Structure (PET-BIDS), providing a standardized framework for PET datasets and metadata. Building on PET-BIDS, tools such as PET2BIDS, ezBIDS, and BIDSCoin facilitate data conversion and curation. In parallel, OpenNeuro now hosts PET-BIDS datasets for open sharing, while complementary platforms such as PublicnEUro enable GDPR-compliant controlled access. Emerging open-source workflows and BIDS applications further support automated, reproducible PET preprocessing and quantitative analysis, promoting harmonized processing across centers. Together, these developments mark an important step toward an open molecular neuroimaging ecosystem in which datasets, software, and workflows can be transparently shared, reused, and scaled for collaborative research.
- Research Article
- 10.1016/j.apmr.2026.04.033
- May 8, 2026
- Archives of physical medicine and rehabilitation
- Amanda Rabinowitz + 2 more
Reining In Unbridled AI Enthusiasm: Protecting the Integrity of Rehabilitation Science & Clinical Care.
- Research Article
- 10.1021/acs.jctc.5c02081
- May 7, 2026
- Journal of chemical theory and computation
- Zongru Li + 10 more
Molecular property prediction integrates quantum chemistry, cheminformatics, and deep learning to connect molecular structure with physicochemical and biological behavior. This survey traces four complementary paradigms, including Quantum, Descriptor Machine Learning, Geometric Deep Learning, and Foundation Models, and outlines a unified taxonomy linking molecular representations, model architectures, and interdisciplinary applications. Benchmark analyses integrate evidence from both widely used data sets and data sets reflecting industry perspectives, encompassing quantum, physicochemical, physiological, and biophysical domains. The survey examines current standards in data curation, splitting strategies, and evaluation protocols, highlighting challenges including inconsistent stereochemistry, heterogeneous assay sources, and reproducibility limitations under random or poorly defined splits. These observations motivate the modernization of benchmark design toward more transparent, time- and scaffold-aware methodologies. We further propose three forward-looking directions: (i) physics-aware learning embedding quantum consistency, (ii) uncertainty-calibrated foundation models for trustworthy inference, and (iii) realistic multimodal benchmark ecosystems integrating computational and experimental data. Repository: https://github.com/Zongru-Li/Survey-and-Benchmarks-of-DL-for-Molecular-Property-Prediction-in-the-Foundation-Model-Era.
- Research Article
- 10.1016/j.actatropica.2026.108132
- May 7, 2026
- Acta tropica
- Harun Kaya Kesik + 3 more
Worldwide hotspots and ecological drivers of canine Echinococcus granulosus sensu lato: Space-time scan statistics and Maxent modelling from a systematic evidence base.
- Research Article
- 10.47852/bonviewmedin62028877
- May 7, 2026
- Medinformatics
- Ferdousi Ahmed Sumona + 6 more
Inflammation is a complex biological response that contributes to the pathogenesis of many chronic diseases. Cyclooxygenase2 (COX-2) plays a central role in inflammatory processes by catalyzing the synthesis of proinflammatory prostaglandins. Natural COX-2 inhibitors have gained increasing interest as potential alternatives to synthetic drugs due to their improved safety profiles. Amaranthus tricolor L., a leafy vegetable, is traditionally recognized for its anti-inflammatory and antioxidant properties. This study investigated the anti-inflammatory potential of major phytochemicals from A. tricolor using an integrated in silico approach involving molecular docking, ADMET profiling, and quantitative structure–activity relationship (QSAR) analysis to identify promising COX2 inhibitors. A total of thirty-three phytocompounds reported in A. tricolor were identified and screened against the crystal structure of COX-2 enzyme (PDB: 1CX2) using molecular docking. Diclofenac was employed as a reference drug. Docking was performed using PyRx, binding interactions were visualized using BIOVIA Discovery Studio. The physicochemical, pharmacokinetic, drug-likeness properties of the top-scoring ligands were evaluated using SwissADME, admetSAR, ChemDes, and pkCSM. Docking results revealed that myricetin-3-O-rutinoside exhibited the strongest binding affinity (−10.2 kcal/mol), exceeding that of diclofenac (−7.0 kcal/mol), followed by myricetin (−9.9 kcal/mol), quercetin (−9.5 kcal/mol), and a few others. These compounds formed stable interactions with active-site residues of COX-2. ADMET and QSAR analyses indicated favorable absorption, moderate bioavailability, and acceptable safety profiles for most top ligands, while toxicity prediction suggested low hepatotoxicity and mutagenicity risks. The findings highlight myricetin, quercetin derivatives, and several other phytocompounds from A. tricolor as promising COX-2 inhibitors with potential anti-inflammatory activity. Received: 23 December 2025 | Revised: 30 March 2026 | Accepted: 20 April 2026 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement All data generated or analyzed during this study are included in this published article. Author Contribution Statement Ferdousi Ahmed Sumona: Software, Investigation, Writing – original draft, Visualization. Sawda Binta Kamrul Oishi: Software, Formal analysis, Investigation, Writing – original draft, Visualization. Md. Rakibul Hossain: Formal analysis, Investigation, Writing – original draft, Visualization. Sumaiya Khatun: Formal analysis, Investigation, Writing – original draft, Visualization. Ayesha Islam Sadia: Formal analysis, Investigation, Writing – original draft, Visualization. Md. Sabbir Hossain: Formal analysis, Investigation, Writing – original draft, Visualization. Md. Abu Bakar Siddique Jami: Conceptualization, Methodology, Software, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration.
- Research Article
- 10.1177/20539517261438634
- May 4, 2026
- Big Data & Society
- Yu Sun
This study focuses on the grassroots data practices of an environmental non-government organization (ENGO), examining the constitutive role of data in environmental activism in China. Drawing on the concept of contentious publicness, it analyses the processual and relational dynamics of the ENGO-led data activism to tackle environmental pollution at the interrelated material, spatial and temporal levels. Through participatory observations and in-depth interviews, it first examines the material agency of data infrastructure in enabling environmental participation. Then, it explores the spatiality and temporality of activists’ action repertoires. The findings demonstrate that the activist engagement with data follows a non-confrontational approach, evolving in the compromised middle ground between embeddedness and marginalization. Environmental data serves as a relational mediator in activists’ continuous process of making tactical responses to disrupt the status quo while not completely denying or subverting existing power relations. Moving beyond the contestational view of data, the study applies a non-binary and processual account of data activism, contributes to a deeper understanding of the relational configurations of power in data politics. It sheds light on the institutional, technological, and social imaginaries of environmentalism that shape the building of environmental data infrastructure and cultivate new forms of environmental action in the specific sociopolitical context of China. Moreover, the situated analysis of data activism contributes to diversifying the Western-centric understanding of the transformative potentials of data and calls for more scholarly attention to the relational dynamics of data politics in the Global South.
- Research Article
- 10.2196/96894
- May 4, 2026
- JMIR mental health
- Hina Tahseen
Research on artificial intelligence (AI) and mental health has focused largely on harms at deployment, including chatbot safety, sycophancy, and AI-associated delusions. Less attention has been paid to a prior question: whether the human-generated text and preference judgments that shape large language models (LLMs) are themselves clinically reliable, particularly when self-report may be distorted. This Viewpoint aims to develop the clinical psychiatric construct of collusion, the uncritical acceptance of an unreliable account, as an analytic lens for AI training and deployment, and to argue that the clinical reliability of training and preference data should be treated as an explicit trustworthy-AI criterion in mental-health-relevant systems. A conceptual synthesis of psychiatry, clinical psychology, and AI safety literature was undertaken. The analysis distinguishes three pipeline layers: pre-training corpora, preference data and post-training methods, and deployment-time interaction. It maps the clinical construct of collusion against adjacent technical concepts, including sycophancy, reward overoptimization, grounding, refusal training, red-teaming, and live monitoring. The synthesis suggests that collusion-like dynamics are least applicable at the pre-training layer and most applicable at the preference-data and deployment layers, where unassessed user or labeler input can be reinforced without corroboration. Existing mitigations, including data curation, Constitutional AI, reward-model evaluation, grounded generation, refusal training, red-teaming, and postdeployment monitoring, address parts of this problem. However, these approaches are not yet organized around a clinically informed account of when self-report is unreliable. The central novelty is therefore not a generic claim about bias, but the proposal that clinical self-report reliability should be assessed as a distinct data-quality and governance dimension. Trustworthy-AI frameworks for mental-health-relevant applications should incorporate clinical expertise in self-report reliability into preference-data design, red-teaming, and postmarket surveillance. Adding clinical reliability of training and preference data as an explicit criterion could complement existing technical safeguards while leaving empirical evaluation of clinician involvement as an open research agenda.
- Research Article
- 10.3390/diagnostics16091345
- Apr 29, 2026
- Diagnostics
- Hasan Anıl Kurt + 2 more
Background/Objectives: Prostate adenocarcinoma exhibits substantial inter-patient heterogeneity, limiting the accuracy of current prognostic tools. Prostate-specific antigen-based assessment remains insufficient for reliable survival prediction. There is a clear need for integrative, data-driven approaches that leverage multi-dimensional clinical and molecular data to improve outcome stratification. This study aimed to develop and evaluate an explicable machine learning framework for predicting overall survival in prostate adenocarcinoma. Methods: A comprehensive machine learning pipeline was constructed using clinical and laboratory data from 494 patients in the TCGA PanCancer Atlas cohort. Following data curation, 16 clinically relevant features were selected through expert-guided filtering and feature selection techniques. Missing values were addressed using imputation strategies, and class imbalance was mitigated using SMOTE. Eight machine learning models were evaluated, including a novel hybrid ensemble model combining Gradient Boosting Machine and random forest (GBM + RF). Model performance was assessed using stratified 10-fold cross-validation and quantified via accuracy, precision, recall, F1-score, and ROC-AUC. Model interpretability was examined using LIME, and prognostic relevance was validated through Cox proportional hazards regression. Results: The hybrid GBM + RF model demonstrated superior performance, achieving 97% accuracy and a ROC-AUC of 0.95 under mode imputation with SMOTE balancing. Ensemble-based models consistently outperformed single classifiers, particularly in handling missing data and class imbalance. Key predictors of survival included progression-free survival, hypoxia-related scores, genomic instability markers, and immune-associated variables. Cox regression analysis confirmed the independent prognostic significance of these features, supporting the biological plausibility of the model. Conclusions: An explainable ensemble machine learning approach enables accurate and clinically interpretable prediction of overall survival in prostate adenocarcinoma. The proposed framework provides a robust foundation for precision urology decision-support systems and warrants validation in independent cohorts.
- Research Article
- 10.1038/s41598-026-49775-7
- Apr 29, 2026
- Scientific reports
- Edoardo Passarotto + 7 more
Human-annotated data is foundational for supervised machine learning (ML). Low inter-rater reliability often introduces noise that degrades model performance. This study investigates how human rating reliability and panel size impact ML efficacy, and introduces a novel debiasing procedure utilizing Random Effects Models (REMs) to mitigate annotator noise. We conducted two complementary experiments to evaluate these dynamics. Experiment 1 analyzed real-world assessments from nine evaluators classifying 355 infant images. Results demonstrated that panel size and specific psychometric reliability indices-namely Cronbach's α, Generalizability, and Dependability coefficients-are strong predictors of ML performance across four algorithms, whereas inter-class correlation coefficients proved less robust. Experiment 2 generated simulated datasets mimicking Experiment 1 - incorporating 40 virtual raters with varying expertise levels and structured pattern noise - to evaluate the robustness of model aggregation across diverse rating scenarios. These simulations confirmed that while annotator noise significantly impairs classification, the proposed REM-based debiasing procedure effectively recovers ground-truth scores. Notably, ML models trained on REM-debiased data from merely two raters achieved predictive performance comparable to models utilizing mean-aggregated scores from eight raters. Ultimately, this study underscores the critical importance of psychometrically sound data curation, demonstrating that advanced debiasing techniques can substantially enhance ML accuracy and efficiency even with small expert panels.
- Research Article
- 10.1016/j.survophthal.2026.04.006
- Apr 29, 2026
- Survey of ophthalmology
- Raoul K Khanna + 11 more
Omics in hereditary optic neuropathies: A systematic review of clinical studies with an integrated point of view.
- Research Article
- 10.47852/bonviewaaes62029044
- Apr 28, 2026
- Archives of Advanced Engineering Science
- Massoud Danishmal + 4 more
This study proposes a hybrid PV/wind/grid energy system for small urban bakeries and evaluates its technical performance, economic feasibility, and environmental impact within an integrated framework. Unlike previous studies that assess only economic or environmental aspects, this research combines both dimensions using HOMER Pro and PVsyst simulations over a 20-year project lifetime. The proposed system requires an initial investment of approximately $10,350 and achieves a payback period of 2.28 years, demonstrating strong financial viability. After cost recovery, annual energy savings reach about $4,380, along with an additional monthly income of nearly $212 from surplus electricity under net-metering conditions, which further enhances its economic attractiveness. Environmentally, the system reduces CO2 and other harmful emissions by up to 90% compared to conventional fossil-fuel-based bakery operations, significantly contributing to sustainability goals and cleaner urban environments. The novelty of this study lies in integrating HOMER Pro and PVsyst for simultaneous technical-economic-environmental optimization, applying the hybrid model to small-scale urban bakeries, and quantifying real emission reductions and financial returns under net-metering conditions, thereby providing a comprehensive and practical framework for future renewable energy applications in similar small-scale industries. Received: 8 January 2026 | Revised: 28 February 2026 | Accepted: 7 April 2026 Conflicts of Interest The authors declare that they have no conflicts of interest to this work. Data Availability Statement Data are available from the corresponding author upon reasonable request. Author Contribution Statement Massoud Danishmal: Conceptualization, Methodology, Validation, Formal analysis, Investigation, Data curation, Writing – original draft, Writing – review & editing, Visualization, Supervision, Project administration. Dost Mohammad Sarwari: Writing – review & editing, Resources. Mohammad Yasin Kamaly: Writing – review & editing. Mohammad Adel Adeel: Writing – review & editing. Atiqullah Hamim: Writing – review & editing.
- Research Article
- 10.1080/23257962.2026.2652085
- Apr 26, 2026
- Archives and Records
- Yusuke Takeda + 16 more
ABSTRACT Tomography is a technique used to image the three-dimensional structure of an object by cutting it into parallel two-dimensional slices. This technique has many applications to scientific research, but preserving tomography datasets for scientific repeatability poses considerable logistical and financial challenges. Conventional curation of fundamental tomography data and associated metadata demands long working-person hours because there is no standardized, easy-to-use software. Tomography datasets often exceed 1,000 files, and these large dataset sizes exceed the capacity of open online data repositories, and archiving these data locally demands abundant disk storage. Here, we present a new cost-effective software and hardware developed to facilitate the large-scale curation of tomography data. Our user-friendly software processes RAW tomography data, generates flipbook-style animations, and outputs metadata tables, while a cluster of desktop computers ensures efficient execution. Data are stored on magnetic tapes for robust long-term archiving. This comprehensive, open-source method is designed for easy adoption, especially in museums. It is compatible with the distribution of lightweight 3D models with open data repositories, mirroring the relationship between publishing research papers by publishers and archiving the described specimens by museums. It supports scientific integrity, particularly in the establishment of new species in natural history sciences.