Original Source Code Research Articles

BackgroundThe adverse impact of COVID-19 on marginalized and under-resourced communities of color has highlighted the need for accurate, comprehensive race and ethnicity data. However, a significant technical challenge related to integrating race and ethnicity data in large, consolidated databases is the lack of consistency in how data about race and ethnicity are collected and structured by health care organizations.ObjectiveThis study aims to evaluate and describe variations in how health care systems collect and report information about the race and ethnicity of their patients and to assess how well these data are integrated when aggregated into a large clinical database.MethodsAt the time of our analysis, the National COVID Cohort Collaborative (N3C) Data Enclave contained records from 6.5 million patients contributed by 56 health care institutions. We quantified the variability in the harmonized race and ethnicity data in the N3C Data Enclave by analyzing the conformance to health care standards for such data. We conducted a descriptive analysis by comparing the harmonized data available for research purposes in the database to the original source data contributed by health care institutions. To make the comparison, we tabulated the original source codes, enumerating how many patients had been reported with each encoded value and how many distinct ways each category was reported. The nonconforming data were also cross tabulated by 3 factors: patient ethnicity, the number of data partners using each code, and which data models utilized those particular encodings. For the nonconforming data, we used an inductive approach to sort the source encodings into categories. For example, values such as “Declined” were grouped with “Refused,” and “Multiple Race” was grouped with “Two or more races” and “Multiracial.”Results“No matching concept” was the second largest harmonized concept used by the N3C to describe the race of patients in their database. In addition, 20.7% of the race data did not conform to the standard; the largest category was data that were missing. Hispanic or Latino patients were overrepresented in the nonconforming racial data, and data from American Indian or Alaska Native patients were obscured. Although only a small proportion of the source data had not been mapped to the correct concepts (0.6%), Black or African American and Hispanic/Latino patients were overrepresented in this category.ConclusionsDifferences in how race and ethnicity data are conceptualized and encoded by health care institutions can affect the quality of the data in aggregated clinical databases. The impact of data quality issues in the N3C Data Enclave was not equal across all races and ethnicities, which has the potential to introduce bias in analyses and conclusions drawn from these data. Transparency about how data have been transformed can help users make accurate analyses and inferences and eventually better guide clinical care and public policy.

To the Editor: In recent years, big data analysis has revolutionised our approach to research. Consequent trends have seen a move towards more reproducible research including the use of open access analysis tools. We surveyed ESPGHAN members to assess current utilisation of data analysis tools and investigate potential interest in developing skills in data analysis and coding. Seventy-two individuals from 14 countries representing allied health professionals (n = 6), senior clinicians (n = 26), trainees (n = 33) and researchers (n = 13) responded. Of the top five frequently used tools: Microsoft Excel (67/70), SPSS (34/70), STATA (19/70) and R (14/70), only R is free and openly available. A striking 95% of respondents would be interested in learning another programme. When asked, 95% of individuals would be interested in face to face or online training opportunities. Similarly, 96% of individuals would be interested if ESPGHAN were to provide funding to help skills development in this area. Our ESPGHAN special interest group for basic and translational research aims to pilot a programme in collaboration with the ESPGHAN Research Committee to help support skills developments in data science. Funding would include enrolment to a data science learning platform to undertake certified courses. We also envisage providing face-to-face opportunities with bioinformaticians and statisticians in biological sciences at leading centres that can offer on-site training and support. Although our survey confirms this is a current unmet need, with this initiative we aim to provide ESPGHAN members with a career-changing opportunity that will pave the way to further involvement in high-quality research in our area (Fig. 1).FIGURE 1: Summary of key survey findings. (a) Stacked bar plot showing the frequency of usage of each data analysis tool/language. Y-axis shows the number of respondents. Bars are coloured according to the frequency of usage. (b) Key showing if the languages are considered “Open Source” defined as software for which the original source code is made freely available and may be distributed and modified and “Reproducible” which is defined as analysis that provides the original data, code and software allowing others to reach the same results and conclusions as to the authors. (c) Bar chart showing response to questions regarding interest in learning new tools and in ESPGHAN funding.

Original Source Code Research Articles

Related Topics

Articles published on Original Source Code

Analysis and Experimentation on the ManTraNet Image Forgery Detector

Experiments on Deep Single-Image Portrait Relighting

A Presentation and Short Discussion of rVAD-fast, a Fast Voice Activity Detector

A Brief Analysis of the Dense Extreme Inception Network for Edge Detection

Phase Unwrapping using a Joint CNN and SQD-LSTM Network

A Brief Analysis of the Holistically-Nested Edge Detector

Original source code as used in Werner et al. "Land-neutral negative emissions through biochar-based fertilization – global potentials driven by management and pyrolysis conditions" -- submitted to Mitigation and Adaptation Strategies for Global Change

Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave.

Original source code as used in Werner et al., 2022 "Potential of land-neutral negative emissions through biochar sequestration" -- submitted to Earth's Future

Idse-HE: Hybrid embedding graph neural network for drug side effects prediction

Mitigating Computer Limitations in Replicating Numerical Simulations of a Neural Network Model With Hodgkin-Huxley-Type Neurons.

Reproducible Builds: Increasing the Integrity of Software Supply Chains

Training Opportunities in Data Science Are Welcome in ESPGHAN: A Survey on Behalf of the Special Interest Group for Basic Science and Translational Research.

Deoptfuscator: Defeating Advanced Control-Flow Obfuscation Using Android Runtime (ART)

FSEI-GPU: GPU accelerated simulations of the fluid–structure–electrophysiology interaction in the left heart

The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures

Modeling extra-deep electromagnetic logs using a deep neural network

Erratum: Fibers of word maps and the multiplicities of non-abelian composition factors

Simulation of Quasi-Static Crack Propagation by Adaptive Finite Element Method

PredCom: A Predictive Approach to Collecting Approximated Communication Traces

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Original Source Code Research Articles

Related Topics

Articles published on Original Source Code

Analysis and Experimentation on the ManTraNet Image Forgery Detector

Experiments on Deep Single-Image Portrait Relighting

A Presentation and Short Discussion of rVAD-fast, a Fast Voice Activity Detector

A Brief Analysis of the Dense Extreme Inception Network for Edge Detection

Phase Unwrapping using a Joint CNN and SQD-LSTM Network

A Brief Analysis of the Holistically-Nested Edge Detector

Original source code as used in Werner et al. "Land-neutral negative emissions through biochar-based fertilization – global potentials driven by management and pyrolysis conditions" -- submitted to Mitigation and Adaptation Strategies for Global Change

Issues With Variability in Electronic Health Record Data About Race and Ethnicity: Descriptive Analysis of the National COVID Cohort Collaborative Data Enclave.

Original source code as used in Werner et al., 2022 "Potential of land-neutral negative emissions through biochar sequestration" -- submitted to Earth's Future

Idse-HE: Hybrid embedding graph neural network for drug side effects prediction

Mitigating Computer Limitations in Replicating Numerical Simulations of a Neural Network Model With Hodgkin-Huxley-Type Neurons.

Reproducible Builds: Increasing the Integrity of Software Supply Chains

Training Opportunities in Data Science Are Welcome in ESPGHAN: A Survey on Behalf of the Special Interest Group for Basic Science and Translational Research.

Deoptfuscator: Defeating Advanced Control-Flow Obfuscation Using Android Runtime (ART)

FSEI-GPU: GPU accelerated simulations of the fluid–structure–electrophysiology interaction in the left heart

The NAS Parallel Benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures

Modeling extra-deep electromagnetic logs using a deep neural network

Erratum: Fibers of word maps and the multiplicities of non-abelian composition factors

Simulation of Quasi-Static Crack Propagation by Adaptive Finite Element Method

PredCom: A Predictive Approach to Collecting Approximated Communication Traces