To generate a synthetic sample of 1 million individuals that reflect the characteristics of the population recorded in the Health Survey for England (HSE). We used data from the HSE to determine the age and gender-dependent distributions of continuous variable risk factors (height, weight, BMI, systolic blood pressure, total and HDL cholesterol and their ratio, number of cigarettes/day and units of alcohol/week) and prevalence of binary risk factors (smoking status, diabetes). Spearman rank correlations including age and gender were determined for these risk factors. A table of normally distributed random numbers was generated. Cholesky decomposition was used to replicate the observed Spearman rank correlations in the table of random numbers. Rank correlations that included binary variables were recalibrated to adjust for numerous tied values. The sample was then generated using a reverse look-up of the gamma distribution value using the random percentiles for continuous variables or setting a binary variable to 1 when the random percentile falls below the prevalence threshold. Differences between coefficients were no more than 0.5% for any continuous variable. The prevalence of binary factors in the SS was very well matched with the HSE sample. Smoker incidence rates were 18.8% and 16.7% in the SS versus 18.4% and 16.5% in the HSE sample, for males and females respectively. Prevalence of diabetes in the SS was 13.3% and 7.7% versus 13.2% and 7.8%, and for cardiovascular disease was 17.6% and 14.1% versus 18.2% and 14.6%. Comparing 25th, 50th and 75th quantiles, the maximum difference between the original and synthetic values for BMI and TC/HDL ratio were 0.6Kg and 0.3 respectively. Our new approach generates large synthetic samples with risk factor distributions very closely matching those of the real HSE population. This sample can be used to model the likely impact of new therapies or predict mortality.
Read full abstract