In many settings, an analysis goal is the identification of a factor, or set of factors associated with an event or outcome. Often, these associations are then used for inference and prediction. Unfortunately, in the big data era, the model building and exploration phases of analysis can be time-consuming, especially if constrained by computing power (ie, a typical corporate workstation). To speed up this model development, we propose a novel subsampling scheme to enable rapid model exploration of clustered binary data using flexible yet complex model set-ups (GLMMs with additive smoothing splines). By reframing the binary response prospective cohort study into a case-control-type design, and using our knowledge of sampling fractions, we show one can approximate the model estimates as would be calculated from a full cohort analysis. This idea is extended to derive cluster-specific sampling fractions and thereby incorporate cluster variation into an analysis. Importantly, we demonstrate that previously computationally prohibitive analyses can be conducted in a timely manner on a typical workstation. The approach is applied to analysing risk factors associated with adverse reactions relating to blood donation.
Read full abstract