Size Of Sample Set Research Articles

BackgroundMachine learning algorithms hold potential for improved prediction of all-cause mortality in cardiovascular patients, yet have not previously been developed with high-quality population data. This study compared four popular machine learning algorithms trained on unselected, nation-wide population data from Sweden to solve the binary classification problem of predicting survival versus non-survival 2 years after first myocardial infarction (MI).MethodsThis prospective national registry study for prognostic accuracy validation of predictive models used data from 51,943 complete first MI cases as registered during 6 years (2006–2011) in the national quality register SWEDEHEART/RIKS-HIA (90% coverage of all MIs in Sweden) with follow-up in the Cause of Death register (> 99% coverage). Primary outcome was AUROC (C-statistic) performance of each model on the untouched test set (40% of cases) after model development on the training set (60% of cases) with the full (39) predictor set. Model AUROCs were bootstrapped and compared, correcting the P-values for multiple comparisons with the Bonferroni method. Secondary outcomes were derived when varying sample size (1–100% of total) and predictor sets (39, 10, and 5) for each model. Analyses were repeated on 79,869 completed cases after multivariable imputation of predictors.ResultsA Support Vector Machine with a radial basis kernel developed on 39 predictors had the highest complete cases performance on the test set (AUROC = 0.845, PPV = 0.280, NPV = 0.966) outperforming Boosted C5.0 (0.845 vs. 0.841, P = 0.028) but not significantly higher than Logistic Regression or Random Forest. Models converged to the point of algorithm indifference with increased sample size and predictors. Using the top five predictors also produced good classifiers. Imputed analyses had slightly higher performance.ConclusionsImproved mortality prediction at hospital discharge after first MI is important for identifying high-risk individuals eligible for intensified treatment and care. All models performed accurately and similarly and because of the superior national coverage, the best model can potentially be used to better differentiate new patients, allowing for improved targeting of limited resources. Future research should focus on further model development and investigate possibilities for implementation.

Read full abstract

High-resolution digital soil sensing and mapping is an important and emerging new technology that helps meet the strong and growing global demand for high-resolution soil property data. However, the combination of geophysical sensing and pedometrical techniques to produce soil property maps is complex and requires a well-structured design, from the initial steps of data collection right through to final model validation. In this study, we compare different sampling design strategies – an extension of conditioned Latin hypercube sampling, fuzzy k-means sampling and response surface sampling – as a basis for predicting soil texture, soil organic carbon and soil pH-value at two soil depth intervals using electromagnetic induction (EM38DD and EM31) and gamma spectroscopy (U, K, Th) data. Two different sample set sizes, two different regression approaches (multiple linear least squares and random forests), as well as several resampling and independent validation approaches are compared. In addition to these real-world datasets, we also compared the investigated methods for two comparable simulated datasets.Our accuracy estimation results reveal that an optimal combination of Latin hypercube sampling and random forest regression should be adopted. This is the case for both the real world examples as well as the two synthetic datasets. The analysis conducted indicates that this stems from optimized spread within the state space of the sensors.Iterative LHS subsampling with increasing sample set sizes may potentially be a successful approach for incrementally analyzing and validating the model and thus can help reduce laboratory costs when a certain desired accuracy level is achieved.Comparison between the different validation approaches reveals their complexity and highlights the necessity for adequate independent validation approaches. However, based on the findings of our study, we recommend ‘leave-group-out’ cross-validation and ‘.632 bootstrapping’ as the best estimates to use.Finally, this study shows that there are complex interactions between sampling design, regression approaches and validation approaches, which can greatly influence the final soil property maps and their accuracy estimates.Future work should focus on detailed analysis of Latin hypercube sampling and why it outperformed the other approaches. Therefore, comparisons with other sampling approaches should be conducted, as well as specific ‘sampling-for-validation’ approaches. Therefore we provide the simulated datasets as Supplementary reference material for future comparative analysis.

Read full abstract

Size Of Sample Set Research Articles

Related Topics

Articles published on Size Of Sample Set

Predicting two-year survival versus non-survival after first myocardial infarction using machine learning and Swedish national register data

Examination of Different Item Response Theory Models on Tests Composed of Testlets

URA/SISA Analysis for GPS and Galileo to Support ARAIM

Estimating higher-order structure functions from geophysical turbulence time series: Confronting the curse of the limited sample size.

The mechanical strength of a ceramic porous hollow fiber

Weak Convergence Analysis of Asymptotically Optimal Hypothesis Tests

Incorporating limited field operability and legacy soil samples in a hypercube sampling design for digital soil mapping

Computational performance optimization of support vector machine based on support vectors

Nonparametric estimation of the finite time ruin probability in the classical risk model

Novel statistical methodology reveals that hip shape is associated with incident radiographic hip osteoarthritis among African American women

Survival Prognostic Factors of Male Breast Cancer in Southern Iran: a LASSO-Cox Regression Approach.

Microarray Gene Expression Data Classification Using Feature Selection and Naïve Bayes Classifier

Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data.

Saddlepoint-based bootstrap inference for the spatial dependence parameter in the lattice process

Adjusted Supremum Score‐Type Statistics for Evaluating Non‐Standard Hypotheses

Minimum-cost rapid-growing random trees for segmented assembly path planning

A reshaped approach for protein nanocrystal structure analysis from XFELs

Functional gene-set analysis does not support a major role for synaptic function in attention deficit/hyperactivity disorder (ADHD).

A comparison of calibration sampling schemes at the field scale

ShrinkBayes: a versatile R-package for analysis of count-based sequencing data in complex study designs.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Size Of Sample Set Research Articles

Related Topics

Articles published on Size Of Sample Set

Predicting two-year survival versus non-survival after first myocardial infarction using machine learning and Swedish national register data

Examination of Different Item Response Theory Models on Tests Composed of Testlets

URA/SISA Analysis for GPS and Galileo to Support ARAIM

Estimating higher-order structure functions from geophysical turbulence time series: Confronting the curse of the limited sample size.

The mechanical strength of a ceramic porous hollow fiber

Weak Convergence Analysis of Asymptotically Optimal Hypothesis Tests

Incorporating limited field operability and legacy soil samples in a hypercube sampling design for digital soil mapping

Computational performance optimization of support vector machine based on support vectors

Nonparametric estimation of the finite time ruin probability in the classical risk model

Novel statistical methodology reveals that hip shape is associated with incident radiographic hip osteoarthritis among African American women

Survival Prognostic Factors of Male Breast Cancer in Southern Iran: a LASSO-Cox Regression Approach.

Microarray Gene Expression Data Classification Using Feature Selection and Naïve Bayes Classifier

Applying stability selection to consistently estimate sparse principal components in high-dimensional molecular data.

Saddlepoint-based bootstrap inference for the spatial dependence parameter in the lattice process

Adjusted Supremum Score‐Type Statistics for Evaluating Non‐Standard Hypotheses

Minimum-cost rapid-growing random trees for segmented assembly path planning

A reshaped approach for protein nanocrystal structure analysis from XFELs

Functional gene-set analysis does not support a major role for synaptic function in attention deficit/hyperactivity disorder (ADHD).

A comparison of calibration sampling schemes at the field scale

ShrinkBayes: a versatile R-package for analysis of count-based sequencing data in complex study designs.