Healthcare Datasets Research Articles

Commercial healthcare claims datasets represent a sample of the US population that is biased along socioeconomic/demographic lines; depending on the target population of interest, results derived from these datasets may not generalize. Rigorous comparisons of claims-derived results to ground-truth data that quantify this bias are lacking. (1) To quantify the extent and variation of the bias associated with commercial healthcare claims data with respect to different target populations; (2) To evaluate how socioeconomic/demographic factors may explain the magnitude of the bias. This is a retrospective observational study. Healthcare claims data come from the Merative™ MarketScan® Commercial Database; reference data for comparison come from the State Inpatient Databases (SID) and the US Census. We considered three target populations, aged 18-64 years: (1) all Americans; (2) Americans with health insurance; (3) Americans with commercial health insurance. We analyzed inpatient discharge records of patients aged 18-64 years, occurring between 01/01/2019 to 12/31/2019 in five states: California, Iowa, Maryland, Massachusetts, and New Jersey. We estimated rates of the 250 most common inpatient procedures, using claims data and using reference data for each target population, and we compared the two estimates. The average rate of inpatient discharges per 100 person-years was 5.39 in the claims data (95% CI: [5.37, 5.40]) and 7.003 (95% CI: [7.002, 7.004]) in the reference data for all Americans, corresponding to a 23.1% underestimate from claims. We found large variation in the extent of relative bias across inpatient procedures, including 22.8% of procedures that were underestimated by more than a factor of 2. There was a significant relationship between socioeconomic/demographic factors and the magnitude of bias: procedures that disproportionately occur in disadvantaged neighborhoods were more underestimated in claims data (R 2 51.6%, p < 0.001). When the target population was restricted to commercially insured Americans, the bias decreased substantially (3.2% of procedures were biased by more than factor of 2), but some variation across procedures remained. Naïve use of healthcare claims data to derive estimates for the underlying US population can be severely biased. The extent of bias is at least partially explained by neighborhood-level socioeconomic factors.

Read full abstract

The integrity and reliability of clinical research outcomes rely heavily on access to vast amounts of data. However, the fragmented distribution of these data across multiple institutions, along with ethical and regulatory barriers, presents significant challenges to accessing relevant data. While federated learning offers a promising solution to leverage insights from fragmented data sets, its adoption faces hurdles due to implementation complexities, scalability issues, and inclusivity challenges. This paper introduces Federated Learning for Everyone (FL4E), an accessible framework facilitating multistakeholder collaboration in clinical research. It focuses on simplifying federated learning through an innovative ecosystem-based approach. The "degree of federation" is a fundamental concept of FL4E, allowing for flexible integration of federated and centralized learning models. This feature provides a customizable solution by enabling users to choose the level of data decentralization based on specific health care settings or project needs, making federated learning more adaptable and efficient. By using an ecosystem-based collaborative learning strategy, FL4E encourages a comprehensive platform for managing real-world data, enhancing collaboration and knowledge sharing among its stakeholders. Evaluating FL4E's effectiveness using real-world health care data sets has highlighted its ecosystem-oriented and inclusive design. By applying hybrid models to 2 distinct analytical tasks-classification and survival analysis-within real-world settings, we have effectively measured the "degree of federation" across various contexts. These evaluations show that FL4E's hybrid models not only match the performance of fully federated models but also avoid the substantial overhead usually linked with these models. Achieving this balance greatly enhances collaborative initiatives and broadens the scope of analytical possibilities within the ecosystem. FL4E represents a significant step forward in collaborative clinical research by merging the benefits of centralized and federated learning. Its modular ecosystem-based design and the "degree of federation" feature make it an inclusive, customizable framework suitable for a wide array of clinical research scenarios, promising to revolutionize the field through improved collaboration and data use. Detailed implementation and analyses are available on the associated GitHub repository.

Read full abstract

Healthcare Datasets Research Articles

Related Topics

Articles published on Healthcare Datasets

Unit-Power Half-Normal Distribution Including Quantile Regression with Applications to Medical Data

Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods

Increase in major osteoporotic fractures after therapy with immune checkpoint inhibitors

Benchmarking commercial healthcare claims data.

The Multicollinearity Effect on the Performance of Machine Learning Algorithms: Case Examples in Healthcare Modelling

Age-stratified predictions of suicide attempts using machine learning in middle and late adolescence

JmBIG: enhancing dynamic risk prediction and personalized medicine through joint modeling of longitudinal and survival data in big routinely collected data

Exploring The Efficiency of Metaheuristics in Optimal Hyperparameter Tuning for Ensemble Models on Varied Data Modalities

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

A novel approach for e-health recommender systems

Implementing and evaluating simple resampling techniques in federated learning for imbalanced data

Report on the Fifth International Workshop on Health Data Management in the Era of AI (HeDAI 2023)

Communicating exploratory unsupervised machine learning analysis in age clustering for paediatric disease

Optimization of machine learning models through quantization and data bit reduction in healthcare datasets

Accessible Ecosystem for Clinical Research (Federated Learning for Everyone): Development and Usability Study.

Postpartum haemorrhage and risk of cardiovascular disease in later life: A population-based record linkage cohort study.

RWD112 Going Beyond Claims: Unleashing the Power of Diverse Healthcare Datasets in Clinical and HEOR Assessments

Risk of drug-related death associated with co-prescribing of gabapentinoids and Z-drugs among people receiving opioid-agonist treatment: A national retrospective cohort study

Real-life effectiveness of sacubitril/valsartan in older Belgians with heart failure, reduced ejection fraction and most severe symptoms

The Evolving Role of Artificial Intelligence in Radiotherapy Treatment Planning—A Literature Review

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Healthcare Datasets Research Articles

Related Topics

Articles published on Healthcare Datasets

Unit-Power Half-Normal Distribution Including Quantile Regression with Applications to Medical Data

Improving class probability estimates in asymmetric health data classification: An experimental comparison of novel calibration methods

Increase in major osteoporotic fractures after therapy with immune checkpoint inhibitors

Benchmarking commercial healthcare claims data.

The Multicollinearity Effect on the Performance of Machine Learning Algorithms: Case Examples in Healthcare Modelling

Age-stratified predictions of suicide attempts using machine learning in middle and late adolescence

JmBIG: enhancing dynamic risk prediction and personalized medicine through joint modeling of longitudinal and survival data in big routinely collected data

Exploring The Efficiency of Metaheuristics in Optimal Hyperparameter Tuning for Ensemble Models on Varied Data Modalities

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

A novel approach for e-health recommender systems

Implementing and evaluating simple resampling techniques in federated learning for imbalanced data

Report on the Fifth International Workshop on Health Data Management in the Era of AI (HeDAI 2023)

Communicating exploratory unsupervised machine learning analysis in age clustering for paediatric disease

Optimization of machine learning models through quantization and data bit reduction in healthcare datasets

Accessible Ecosystem for Clinical Research (Federated Learning for Everyone): Development and Usability Study.

Postpartum haemorrhage and risk of cardiovascular disease in later life: A population-based record linkage cohort study.

RWD112 Going Beyond Claims: Unleashing the Power of Diverse Healthcare Datasets in Clinical and HEOR Assessments

Risk of drug-related death associated with co-prescribing of gabapentinoids and Z-drugs among people receiving opioid-agonist treatment: A national retrospective cohort study

Real-life effectiveness of sacubitril/valsartan in older Belgians with heart failure, reduced ejection fraction and most severe symptoms

The Evolving Role of Artificial Intelligence in Radiotherapy Treatment Planning—A Literature Review