Abstract
Under cohort sampling designs, additional covariate data are collected on cases of a specific type and a randomly selected subset of noncases, primarily for the purpose of studying associations with a time-to-event response of interest. With such data available, an interest may arise to reuse them for studying associations between the additional covariate data and a secondary non-time-to-event response variable, usually collected for the whole study cohort at the outset of the study. Following earlier literature, we refer to such a situation as secondary analysis. We outline a general conditional likelihood approach for secondary analysis under cohort sampling designs and discuss the specific situations of case-cohort and nested case-control designs. We also review alternative methods based on full likelihood and inverse probability weighting. We compare the alternative methods for secondary analysis in two simulated settings and apply them in a real-data example.
Highlights
Cohort sampling designs are two-phase epidemiological study designs where information on time-to-event outcomes of interest over a followup period and some basic covariate data are collected on the whole first-phase study group, referred to as a cohort, and in the second phase, more expensive or difficult-to-obtain additional covariate data are collected only on a subset of the study cohort
Examples are the case-cohort 1–3 and nested case-control 4, 5 designs. Such designs are applied for the purpose of studying associations between the time-to-event Journal of Probability and Statistics outcomes and the covariates collected in the second phase
Conditional likelihood inference under cohort sampling designs has been studied previously for the analysis of the primary time-to-event outcome by Langholz and Goldstein and Saarela and Kulathinal ; here, we extend these methods to the secondary analysis setting
Summary
Cohort sampling designs are two-phase epidemiological study designs where information on time-to-event outcomes of interest over a followup period and some basic covariate data are collected on the whole first-phase study group, referred to as a cohort, and in the second phase, more expensive or difficult-to-obtain additional covariate data are collected only on a subset of the study cohort. Conditional likelihood inference under cohort sampling designs has been studied previously for the analysis of the primary time-to-event outcome by Langholz and Goldstein and Saarela and Kulathinal ; here, we extend these methods to the secondary analysis setting. Additional covariate data here the lactase persistence genotype Zi are collected only on the second-phase study group O ≡ {i : Ri 1} ⊆ C, specified by the inclusion indicators Ri ∈ {0, 1}, analogously to the survey response/nonresponse setting of Rubin 21. Observed data likelihoods may become sensitive to misspecification of the model for the response variable; the missing data can act to extra parameters, and the actual model parameters may lose their intended interpretation This is a real problem especially in cohort sampling designs with a rare event of interest, since the proportion of uncollected covariate data in the study cohort may be very high.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have