Metadata Framework Research Articles

In current biomedical and complex trait research, increasing numbers of large molecular profiling (omics) data sets are being generated. At the same time, many studies fail to be reproduced (Baker 2016, Kim 2018). In order to improve study reproducibility and data reuse, including integration of data sets of different types and origins, it is imperative to work with omics data that is findable, accessible, interoperable, and reusable (FAIR, Wilkinson 2016) at the source. The data analysis, integration and stewardship pillar of the Netherlands X-omics Initiative aims to facilitate multi-omics research by providing tools to create, analyze and integrate FAIR omics data. We here report a joint activity of X-omics and the Netherlands Twin Register demonstrating the FAIRification of a multi-omics data set and the development of a FAIR multi-omics data analysis workflow. The implementation of FAIR principles (Wilkinson 2016) can improve scientific transparency and facilitate data reuse. However, Kim (2018) showed in a case study that the availability of data and code are required but not sufficient to reproduce data analyses. They highlighted the importance of interoperable and open formats, and structured metadata. In order to increase research reproducibility on the data analysis level, additional practices such as version-control, code licensing, and documentation have been proposed. These include recommendations for FAIR software by the Netherlands eScience Center and the Dutch Data Archiving and Networked Services (DANS), and FAIR principles for research software proposed by the Research Data Alliance (Chue Hong 2022). Data analysis in biomedical research usually comprises multiple steps often resulting in complex data analysis workflows and requiring additional practices, such as containerization, to ensure transparency and reproducibility (Goble 2020, Stoudt 2021). We apply these practices to a multi-omics data set that comprises genome-wide DNA methylation profiles, targeted metabolomics, and behavioral data of two cohorts that participated in the ACTION Biomarker Study (ACTION, Aggression in Children: Unraveling gene-environment interplay to inform Treatment and InterventiON strategies, see consortium members in Suppl. material 1) (Boomsma 2015, Bartels 2018, Hagenbeek 2020, van Dongen 2021, Hagenbeek 2022). The ACTION-NTR cohort consists of twins that are either longitudinally concordant or discordant for childhood aggression. The ACTION-Curium-LUMC cohort consists of children referred to the Dutch LUMC Curium academic center for child and youth psychiatry. With the joint analysis of multi-omics data and behavioral data, we aim to identify substructures in the ACTION-NTR cohort and link them to aggressive behavior. First, the individuals are clustered using Similarity Network Fusion (SNF, Wang 2014), and latent feature dimensions are uncovered using different unsupervised methods including Multi-Omics Factor Analysis (MOFA) (Argelaguet 2018) and Multiple Correspondence Analysis (MCA, Lê 2008, Husson 2017). In a second step, we determine correlations between -omics and phenotype dimensions, and use them to explain the subgroups of individuals from the ACTION-NTR cohort. In order to validate the results, we project data of the ACTION-Curium-LUMC cohort onto the latent dimensions and determine if correlations between omics and phenotype data can be reproduced. Integration of data across cohorts and across data types, requires interoperability. We applied different practices to make the data FAIR, including conversion of files to community-standard formats, and capturing experimental metadata using the ISA (Investigation, Study, Assay) metadata framework (Johnson 2021) and ontology-based annotations. All data analysis steps including pre-processing of different omics data types were implemented in either R or Python and combined in a modular Nextflow (Di Tommaso 2017) workflow, where the environment for each step is provided as a Singularity (Kurtzer 2017) container. The analysis workflow is packaged in a Research Object Crate (RO-Crate) (Soiland-Reyes 2022). The RO-Crate is a FAIR digital object that contains the Nextflow workflow including ontology-based annotations of each analysis step. Since omics data is considered to be potentially personally identifiable, the packaged workflow contains a minimal synthetic data set resembling the original data structure. Finally, the code is made available on GitHub and the workflow is registered at Workflowhub (Goble 2021). Since our Nextflow workflow is set up in a modular manner, the individual analysis steps can be reused in other workflows. We demonstrate this replicability by applying different sub-workflows to data from two different cohorts.

Read full abstract

BackgroundProvenance supports the understanding of data genesis, and it is a key factor to ensure the trustworthiness of digital objects containing (sensitive) scientific data. Provenance information contributes to a better understanding of scientific results and fosters collaboration on existing data as well as data sharing. This encompasses defining comprehensive concepts and standards for transparency and traceability, reproducibility, validity, and quality assurance during clinical and scientific data workflows and research.ObjectiveThe aim of this scoping review is to investigate existing evidence regarding approaches and criteria for provenance tracking as well as disclosing current knowledge gaps in the biomedical domain. This review covers modeling aspects as well as metadata frameworks for meaningful and usable provenance information during creation, collection, and processing of (sensitive) scientific biomedical data. This review also covers the examination of quality aspects of provenance criteria.MethodsThis scoping review will follow the methodological framework by Arksey and O'Malley. Relevant publications will be obtained by querying PubMed and Web of Science. All papers in English language will be included, published between January 1, 2006 and March 23, 2021. Data retrieval will be accompanied by manual search for grey literature. Potential publications will then be exported into a reference management software, and duplicates will be removed. Afterwards, the obtained set of papers will be transferred into a systematic review management tool. All publications will be screened, extracted, and analyzed: title and abstract screening will be carried out by 4 independent reviewers. Majority vote is required for consent to eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading will be performed independently by 2 reviewers and in the last step, key information will be extracted on a pretested template. If agreement cannot be reached, the conflict will be resolved by a domain expert. Charted data will be analyzed by categorizing and summarizing the individual data items based on the research questions. Tabular or graphical overviews will be given, if applicable.ResultsThe reporting follows the extension of the Preferred Reporting Items for Systematic reviews and Meta-Analyses statements for Scoping Reviews. Electronic database searches in PubMed and Web of Science resulted in 469 matches after deduplication. As of September 2021, the scoping review is in the full-text screening stage. The data extraction using the pretested charting template will follow the full-text screening stage. We expect the scoping review report to be completed by February 2022.ConclusionsInformation about the origin of healthcare data has a major impact on the quality and the reusability of scientific results as well as follow-up activities. This protocol outlines plans for a scoping review that will provide information about current approaches, challenges, or knowledge gaps with provenance tracking in biomedical sciences.International Registered Report Identifier (IRRID)DERR1-10.2196/31750

Read full abstract

Metadata Framework Research Articles

Related Topics

Articles published on Metadata Framework

Make data flow: Understanding the (re)usability of research data

Comparison of basal ganglia regions across murine brain atlases using metadata models and the Waxholm Space

Common Metadata Framework: Integrated Framework for Trustworthy Artificial Intelligence Pipelines

《数据论文出版元数据》国家标准研制与实践

Common Metadata Framework for Research Data Repository: Necessity to Support Open Science

Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets.

Museum Education Using XR Technologies: A Survey of Metadata Models

Ontologies for increasing the FAIRness of plant research data

Semantics-Aware Document Retrieval for Government Administrative Data

A metadata framework for computational phenotypes.

FAIR data station for lightweight metadata management and validation of omics studies.

A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object

The Database Construction of Intangible Cultural Heritage Based on Artificial Intelligence

Metadata Framework to Support Deployment of Digital Health Technologies in Clinical Trials in Parkinson's Disease.

Semantic Association and Decision-Making for the Internet of Things Based on Partial Differential Fuzzy Unsupervised Models

Satu Data Indonesia in Sectoral Statistics: Concept of Satu Data Metadata Framework (SDMF)

Addressing the Challenges of Describing Alternative Format Materials: A Metadata Framework to Enhance Information Accessibility of People with Disabilities

A Metadata Framework for Asset Management Decision Support: A Water Infrastructure Case Study

Approaches and Criteria for Provenance in Biomedical Data Sets and Workflows: Protocol for a Scoping Review

ISA API: An open platform for interoperable life science experimental metadata.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Metadata Framework Research Articles

Related Topics

Articles published on Metadata Framework

Make data flow: Understanding the (re)usability of research data

Comparison of basal ganglia regions across murine brain atlases using metadata models and the Waxholm Space

Common Metadata Framework: Integrated Framework for Trustworthy Artificial Intelligence Pipelines

《数据论文出版元数据》国家标准研制与实践

Common Metadata Framework for Research Data Repository: Necessity to Support Open Science

Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets.

Museum Education Using XR Technologies: A Survey of Metadata Models

Ontologies for increasing the FAIRness of plant research data

Semantics-Aware Document Retrieval for Government Administrative Data

A metadata framework for computational phenotypes.

FAIR data station for lightweight metadata management and validation of omics studies.

A Multi-omics Data Analysis Workflow Packaged as a FAIR Digital Object

The Database Construction of Intangible Cultural Heritage Based on Artificial Intelligence

Metadata Framework to Support Deployment of Digital Health Technologies in Clinical Trials in Parkinson's Disease.

Semantic Association and Decision-Making for the Internet of Things Based on Partial Differential Fuzzy Unsupervised Models

Satu Data Indonesia in Sectoral Statistics: Concept of Satu Data Metadata Framework (SDMF)

Addressing the Challenges of Describing Alternative Format Materials: A Metadata Framework to Enhance Information Accessibility of People with Disabilities

A Metadata Framework for Asset Management Decision Support: A Water Infrastructure Case Study

Approaches and Criteria for Provenance in Biomedical Data Sets and Workflows: Protocol for a Scoping Review

ISA API: An open platform for interoperable life science experimental metadata.