Information Needs and Data Harmonization—Two Sides of the Same Coin?
Information Needs and Data Harmonization—Two Sides of the Same Coin?
- Research Article
- 10.3389/fninf.2024.1385526
- May 17, 2024
- Frontiers in neuroinformatics
There is an increasing desire to study neurodevelopmental disorders (NDDs) together to understand commonalities to develop generic health promotion strategies and improve clinical treatment. Common data elements (CDEs) collected across studies involving children with NDDs afford an opportunity to answer clinically meaningful questions. We undertook a retrospective, secondary analysis of data pertaining to sleep in children with different NDDs collected through various research studies. The objective of this paper is to share lessons learned for data management, collation, and harmonization from a sleep study in children within and across NDDs from large, collaborative research networks in the Ontario Brain Institute (OBI). Three collaborative research networks contributed demographic data and data pertaining to sleep, internalizing symptoms, health-related quality of life, and severity of disorder for children with six different NDDs: autism spectrum disorder; attention deficit/hyperactivity disorder; obsessive compulsive disorder; intellectual disability; cerebral palsy; and epilepsy. Procedures for data harmonization, derivations, and merging were shared and examples pertaining to severity of disorder and sleep disturbances were described in detail. Important lessons emerged from data harmonizing procedures: prioritizing the collection of CDEs to ensure data completeness; ensuring unprocessed data are uploaded for harmonization in order to facilitate timely analytic procedures; the value of maintaining variable naming that is consistent with data dictionaries at time of project validation; and the value of regular meetings with the research networks to discuss and overcome challenges with data harmonization. Buy-in from all research networks involved at study inception and oversight from a centralized infrastructure (OBI) identified the importance of collaboration to collect CDEs and facilitate data harmonization to improve outcomes for children with NDDs.
- Research Article
66
- 10.2105/ajph.2015.302788
- Oct 15, 2015
- American Journal of Public Health
Large-scale, multisite data sets offer the potential for exploring the public health benefits of biomedical interventions. Data harmonization is an emerging strategy to increase the comparability of research data collected across independent studies, enabling research questions to be addressed beyond the capacity of any individual study. The National Institute on Drug Abuse recently implemented this novel strategy to prospectively collect and harmonize data across 22 independent research studies developing and empirically testing interventions to effectively deliver an HIV continuum of care to diverse drug-abusing populations. We describe this data collection and harmonization effort, collectively known as the Seek, Test, Treat, and Retain Data Collection and Harmonization Initiative, which can serve as a model applicable to other research endeavors.
- Research Article
- 10.1177/14034948211052164
- Oct 27, 2021
- Scandinavian Journal of Public Health
Aims: There are several advantages to pooling survey data from individual studies over time or across different countries. Our aim is to share our experiences on harmonizing data from 13 Finnish health examination surveys covering the years 1972–2017 and to describe the challenges related to harmonizing different variable types using two questionnaire variables – blood pressure measurement and total cholesterol assessment – as examples. Methods: Data from Finnish national population-based health surveys were harmonized as part of the research project ‘Projections of the Burden of Disease and Disability in Finland – Health Policy Prospects’, including variables from questionnaires, objective health measurements and results from the laboratory analysis of biological samples. The process presented in the Maelstrom Research guidelines for data harmonization was followed with minor adjustments. Results: The harmonization of data from objective measurements and biomarkers was reasonably straightforward, but questionnaire items proved more challenging. Some questions and response options had changed during the covered time period. This concerned, for example, questionnaire items on the availability and use of medication and diet. Conclusions: The long time period – 45 years – made harmonization more complicated. The survey questions or response options had changed for some topics due to changes in society. However, common core variables for topics that were especially relevant for the project, such as lifestyle factors and certain diseases or conditions, could be harmonized with sufficient comparability. For future surveys, the use of standardized survey methods and the proper documentation of data collection are recommended to facilitate harmonization.
- Research Article
3
- 10.3233/shti210264
- May 27, 2021
- Studies in health technology and informatics
Vaccination information is needed at individual and at population levels, as it is an important part of public health measures. In Finland, a vaccination data structure has been developed for centralized information services that include patient access to information. Harmonization of data with national vaccination registry is ongoing. New requirements for vaccination certificates have emerged because of COVID-19 pandemic. We explore, what is the readiness of Finnish development of vaccination data structures and what can be learned from Finnish harmonization efforts in order to accomplish required level of interoperability.
- Research Article
170
- 10.1371/journal.pone.0027899
- Nov 16, 2011
- PLoS ONE
Using data from eight UK cohorts participating in the Healthy Ageing across the Life Course (HALCyon) research programme, with ages at physical capability assessment ranging from 50 to 90+ years, we harmonised data on objective measures of physical capability (i.e. grip strength, chair rising ability, walking speed, timed get up and go, and standing balance performance) and investigated the cross-sectional age and gender differences in these measures. Levels of physical capability were generally lower in study participants of older ages, and men performed better than women (for example, results from meta-analyses (N = 14,213 (5 studies)), found that men had 12.62 kg (11.34, 13.90) higher grip strength than women after adjustment for age and body size), although for walking speed, this gender difference was attenuated after adjustment for body size. There was also evidence that the gender difference in grip strength diminished with increasing age,whereas the gender difference in walking speed widened (p<0.01 for interactions between age and gender in both cases). This study highlights not only the presence of age and gender differences in objective measures of physical capability but provides a demonstration that harmonisation of data from several large cohort studies is possible. These harmonised data are now being used within HALCyon to understand the lifetime social and biological determinants of physical capability and its changes with age.
- Supplementary Content
30
- 10.1200/cci.19.00169
- Jul 9, 2020
- JCO Clinical Cancer Informatics
PURPOSEThe cancer research community is constantly evolving to better understand tumor biology, disease etiology, risk stratification, and pathways to novel treatments. Yet the clinical cancer genomics field has been hindered by redundant efforts to meaningfully collect and interpret disparate data types from multiple high-throughput modalities and integrate into clinical care processes. Bespoke data models, knowledgebases, and one-off customized resources for data analysis often lack adequate governance and quality control needed for these resources to be clinical grade. Many informatics efforts focused on genomic interpretation resources for neoplasms are underway to support data collection, deposition, curation, harmonization, integration, and analytics to support case review and treatment planning.METHODSIn this review, we evaluate and summarize the landscape of available tools, resources, and evidence used in the evaluation of somatic and germline tumor variants within the context of molecular tumor boards.RESULTSMolecular tumor boards (MTBs) are collaborative efforts of multidisciplinary cancer experts equipped with genomic interpretation resources to aid in the delivery of accurate and timely clinical interpretations of complex genomic results for each patient, within an institution or hospital network. Virtual MTBs (VMTBs) provide an online forum for collaborative governance, provenance, and information sharing between experts outside a given hospital network with the potential to enhance MTB discussions. Knowledge sharing in VMTBs and communication with guideline-developing organizations can lead to progress evidenced by data harmonization across resources, crowd-sourced and expert-curated genomic assertions, and a more informed and explainable usage of artificial intelligence.CONCLUSIONAdvances in cancer genomics interpretation aid in better patient and disease classification, more streamlined identification of relevant literature, and a more thorough review of available treatments and predicted patient outcomes.
- Research Article
8
- 10.1186/s12874-021-01494-5
- Jan 7, 2022
- BMC Medical Research Methodology
BackgroundThe small sample sizes available within many very preterm (VPT) longitudinal birth cohort studies mean that it is often necessary to combine and harmonise data from individual studies to increase statistical power, especially for studying rare outcomes. Curating and mapping data is a vital first step in the process of data harmonisation. To facilitate data mapping and harmonisation across VPT birth cohort studies, we developed a custom classification system as part of the Research on European Children and Adults born Preterm (RECAP Preterm) project in order to increase the scope and generalisability of research and the evaluation of outcomes across the lifespan for individuals born VPT.MethodsThe multidisciplinary consortium of expert clinicians and researchers who made up the RECAP Preterm project participated in a four-phase consultation process via email questionnaire to develop a topic-specific classification system. Descriptive analyses were calculated after each questionnaire round to provide pre- and post- ratings to assess levels of agreement with the classification system as it developed. Amendments and refinements were made to the classification system after each round.ResultsExpert input from 23 clinicians and researchers from the RECAP Preterm project aided development of the classification system’s topic content, refining it from 10 modules, 48 themes and 197 domains to 14 modules, 93 themes and 345 domains. Supplementary classifications for target, source, mode and instrument were also developed to capture additional variable-level information. Over 22,000 individual data variables relating to VPT birth outcomes have been mapped to the classification system to date to facilitate data harmonisation. This will continue to increase as retrospective data items are mapped and harmonised variables are created.ConclusionsThis bespoke preterm birth classification system is a fundamental component of the RECAP Preterm project’s web-based interactive platform. It is freely available for use worldwide by those interested in research into the long term impact of VPT birth. It can also be used to inform the development of future cohort studies.
- Research Article
1
- 10.5617/dhnbpub.11311
- Oct 6, 2022
- Digital Humanities in the Nordic and Baltic Countries Publications
This paper discusses current challenges in archaeological cultural heritage data management and presents the interdisciplinary research project DigiNUMA. The project investigates solutions in data harmonisation and dissemination of pan-European cultural heritage through an interdisciplinary and cross-sectoral project in Digital Humanities, semantic computing, participatory heritage, museum collections management and archaeological/numismatic studies. Using Finnish and English numismatic data as a test case, DigiNUMA creates ontological infrastructure and a proof-of-concept data model for finely-grained Linked Open Data (LOD) harmonisation across national and international databases for cultural heritage data, and tests it through a broad suite of Digital Humanities analyses.
- Preprint Article
- 10.5194/egusphere-egu25-13062
- Mar 18, 2025
There has been a rapid increase in the number of studies on both trash and microplastics in recent years, with little data standardization. However, as data is being produced by a wide range of practitioners with differing study goals, researchers adhering to a single data standard may not be realistic. Post-hoc data harmonization is a pathway that transforms non-standardized data from prior studies into harmonized, comparable databases. Harmonization, however, is hindered by the vast number of categorical descriptors used to describe trash and microplastics (thousands or more), making manual harmonization efforts labor intensive. Additionally, non-semantic data misalignment also exists as different studies measure plastic occurrence via different metrics (particle count, mass, volume, etc.) and evaluate differing size ranges that must be rescaled to make meaningful comparisons between concentrations. We created Microplastics and Trash Cleaning and Harmonization (MaTCH), an AI automated algorithm utilizing manually developed databases that describe relationships between categorical descriptors of trash and microplastic particles. MaTCH also integrates other data harmonization techniques to address non-semantic issues of misalignment. All steps are combined into a single algorithm that can harmonize datasets from studies using various nomenclature, study methods, data formats, and reporting metrics. MaTCH is available as an open-source web tool for the research community to rapidly and accurately leverage existing data from trash and microplastic studies to better perform meta-analyses and make more meaningful assessments of data trends. By providing MaTCH as a live web-tool, we are able to include data from new and emerging studies to improve algorithm performance and keep up with the rapid pace of discovery. In a field as labor intensive as plastics research, we believe this may greatly expedite future discovery.
- Supplementary Content
2
- 10.1097/cce.0000000000001179
- Nov 15, 2024
- Critical Care Explorations
A growing body of critical care research draws on real-world data from electronic health records (EHRs). The bedside clinician has myriad data sources to aid in clinical decision-making, but the lack of data sharing and harmonization standards leaves much of this data out of reach for multi-institution critical care research. The Society of Critical Care Medicine (SCCM) Discovery Data Science Campaign convened a panel of critical care and data science experts to explore and document unique advantages and opportunities for leveraging EHR data in critical care research. This article reviews and illustrates six organizing topics (data domains and common data elements; data harmonization; data quality; data interoperability and digital infrastructure; data access, sharing, and governance; and ethics and equity) as a data science primer for critical care researchers, laying a foundation for future publications from the SCCM Discovery Data Harmonization and Sharing Guiding Principles Panel.
- Research Article
- 10.1017/s0033291724003301
- Jan 1, 2025
- Psychological medicine
Improving patient outcomes will be enhanced by understanding "what works, for whom?" enabling better matching of patients to available treatments. However, answering this "what works, for whom?" question requires sample sizes that exceed those of most individual trials. Conventional methods for combining data across trials, including aggregate-data meta-analysis, suffer from key limitations including difficulty accounting for differences across trials (e.g., comparing "apples to oranges"). Causally interpretable meta-analysis (CI-MA) addresses these limitations by pairing individual-participant-data (IPD) across trials using advancements in transportability methods to extend causal inferences to clinical "target" populations of interest. Combining IPD across trials also requires careful acquisition and harmonization of data, a challenging process for which practical guidance is not well-described in the literature. We describe methods and work to date for a large harmonization project in pediatric obsessive-compulsive disorder (OCD) that employs CI-MA. We review the data acquisition, harmonization, meta-data coding, and IPD analysis processes for Project Harmony, a study that (1) harmonizes 28 randomized controlled trials, along with target data from a clinical sample of treatment-seeking youth ages 4-20 with OCD, and (2) applies CI-MA to examine "what works, for whom?" We also detail dissemination strategies and partner involvement planned throughout the project to enhance the future clinical utility of CI-MA findings. Data harmonization took approximately 125 hours per trial (3,000 hours total), which was considerably higher than preliminary projections. Applying CI-MA to harmonize data has the potential to answer "what works for whom?" in pediatric OCD.
- Research Article
- 10.1158/1538-7445.am2019-2465
- Jul 1, 2019
- Cancer Research
Gabriella Miller Kids First Pediatric Research Program (GMKF) is a nation-wide, multi-year initiative focused on the integration of large-scale clinically annotated genomic data for childhood cancers and structural birth defects supported by the NIH Common Fund. Awarded by GMKF, the Kids First Data Resource Center (DRC) is tasked to build infrastructure and workflows for data intaking, harmonization, integration and access authorization to empower collaborative discoveries across the GMKF and other integrated datasets. A key challenge for uniform analyses and empowered discovery of large-scale genomic data relates to the diverse genomic processing workflows and methods employed across the sequencing and bioinformatics community. The DRC genomic harmonization team aims to provide “analysis ready” datasets that are “functionally equivalent” across the Kids First datasets and other large-scale genomic data initiatives in order to accelerate the discovery process. Paired with the cloud-based workspace environments of the DRC, such harmonized dataset provide unprecedented opportunities for shared, reproducible discovery by a diverse, collaborative network of researchers. As such, DRC initial pipelines are developed with BWA-MEM alignment on genome build GRCh38 followed by the GATK best practices for germline variant calling and joint genotyping. Common Workflow Language (CWL) is used as the main workflow specification, while Docker technology has been applied to containerize all the tools used by the workflow. Our current workflows are tasked with data harmonization across a number of different experimental platforms including whole genome sequencing, exome sequencing, and RNA-seq. The data processing is done via CAVATICA, an Amazon Web Services (AWS) based cloud computing platform associated with the Kids First DRC Portal co-developed by Seven Bridges Genomics, where workflows feature scatter-gather parallelization and AWS resource optimization. By utilizing such a framework, the DRC team has harmonized over 10,000 WGS and 1,000 RNA-Seq samples across 12 study cohorts within 8 months. This dataset in its current release includes samples from 40 pediatric brain cancers as well as 8 childhood birth defects with the outcome of delivering 150TB harmonized CRAM and 60TB gVCF. With a highly optimized bioinformatics pipeline powered by an efficient cloud-based execution workflow, The DRC platform processes one genome in about 11 hours with an average compute cost of $15 for whole genome alignment and germline variant calling. Here we present our observed challenges and identified opportunities in the analysis and integration of multi-disease pediatric genomic data on a large scale. Citation Format: Yuankun Zhu, Miguel Brown, Batsal Devkota, Bailey Farrow, Bogdan Gavrilovic, Allison Heath, Kyle Hernandez, Avi Kelman, Parimala Killada, Meen Chul Kim, Daniel Kolbman, Mateusz Koptyra, Milan Kovacevic, Maarten Leerkes, Alex Lubneuski, Michele Mattioni, Pichai Raman, Adam Resnick, Nikola Skundric, Deanne Taylor, Junjun Zhang, Bo Zhang, Phillip B. Storm. Genomic harmonization of the Data Resource Center for Gabriella Miller Kids First Pediatric Research Program [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2019; 2019 Mar 29-Apr 3; Atlanta, GA. Philadelphia (PA): AACR; Cancer Res 2019;79(13 Suppl):Abstract nr 2465.
- Research Article
3
- 10.1002/cpz1.1055
- Jun 1, 2024
- Current protocols
Data harmonization involves combining data from multiple independent sources and processing the data to produce one uniform dataset. Merging separate genotypes or whole-genome sequencing datasets has been proposed as a strategy to increase the statistical power of association tests by increasing the effective sample size. However, data harmonization is not a widely adopted strategy due to the difficulties with merging data (including confounding produced by batch effects and population stratification). Detailed data harmonization protocols are scarce and are often conflicting. Moreover, data harmonization protocols that accommodate samples of admixed ancestry are practically non-existent. Existing data harmonization procedures must be modified to ensure the heterogeneous ancestry of admixed individuals is incorporated into additional downstream analyses without confounding results. Here, we propose a set of guidelines for merging multi-platform genetic data from admixed samples that can be adopted by any investigator with elementary bioinformatics experience. We have applied these guidelines to aggregate 1544 tuberculosis (TB) case-control samples from six separate in-house datasets and conducted a genome-wide association study (GWAS) of TB susceptibility. The GWAS performed on the merged dataset had improved power over analyzing the datasets individually and produced summary statistics free from bias introduced by batch effects and population stratification. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Processing separate datasets comprising array genotype data Alternate Protocol 1: Processing separate datasets comprising array genotype and whole-genome sequencing data Alternate Protocol 2: Performing imputation using a local reference panel Basic Protocol 2: Merging separate datasets Basic Protocol 3: Ancestry inference using ADMIXTURE and RFMix Basic Protocol 4: Batch effect correction using pseudo-case-control comparisons.
- Research Article
19
- 10.1016/j.jii.2019.08.001
- Aug 30, 2019
- Journal of Industrial Information Integration
Engineering complex data integration, harmonization and visualization systems
- Research Article
8
- 10.3389/fnins.2023.1146175
- May 25, 2023
- Frontiers in Neuroscience
Data harmonization is a key step widely used in multisite neuroimaging studies to remove inter-site heterogeneity of data distribution. However, data harmonization may even introduce additional inter-site differences in neuroimaging data if outliers are present in the data of one or more sites. It remains unclear how the presence of outliers could affect the effectiveness of data harmonization and consequently the results of analyses using harmonized data. To address this question, we generated a normal simulation dataset without outliers and a series of simulation datasets with outliers of varying properties (e.g., outlier location, outlier quantity, and outlier score) based on a real large-sample neuroimaging dataset. We first verified the effectiveness of the most commonly used ComBat harmonization method in the removal of inter-site heterogeneity using the normal simulation data, and then characterized the effects of outliers on the effectiveness of ComBat harmonization and on the results of association analyses between brain imaging-derived phenotypes and a simulated behavioral variable using the simulation datasets with outliers. We found that, although ComBat harmonization effectively removed the inter-site heterogeneity in multisite data and consequently improved the detection of the true brain-behavior relationships, the presence of outliers could damage severely the effectiveness of ComBat harmonization in the removal of data heterogeneity or even introduce extra heterogeneity in the data. Moreover, we found that the effects of outliers on the improvement of the detection of brain-behavior associations by ComBat harmonization were dependent on how such associations were assessed (i.e., by Pearson correlation or Spearman correlation), and on the outlier location, quantity, and outlier score. These findings help us better understand the influences of outliers on data harmonization and highlight the importance of detecting and removing outliers prior to data harmonization in multisite neuroimaging studies.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.