Introduction to Survival Analysis

Kristin L. Sainani

doi:10.1016/j.pmrj.2016.04.003

Abstract

Survival analysis is a set of statistical tools for analyzing time-to-event outcomes. Time-to-event variables record both whether participants had a binary outcome (eg, died/survived) and when events occurred, thus allowing researchers to calculate rates rather than just proportions. By focusing on rates, survival analysis methods (1) increase statistical power, (2) handle unequal follow-up times, and (3) permit the use of time-changing predictor variables. When planning cohort studies or randomized trials, researchers should consider collecting data on the timing of events to make survival analysis possible. This article reviews common statistical tests and graphics from the survival analysis toolkit. For the purposes of illustration, I generated a mock data set on stress fracture occurrence in 100 male and 100 female runners followed prospectively for 2 years. The data set contains 2 predictors: gender and bone mineral density (BMD) Z scores. Fifty-six runners had a stress fracture during the study. The remaining 144 participants are said to be “censored,” meaning they were fracture free at the time they exited the study. Time-to-event outcomes involve 2 variables: (1) a binary variable that records whether or not a participant had the event of interest; and (2) a continuous variable that records the length of time the participant was in the study until he/she had the event or was censored. Censoring: Participants are said to be “right censored” if they exit the study prior to having the event of interest. Although “left censoring” is also possible—meaning participants enter the study after having the event of interest—this type of censoring is rarely encountered in practice and will not be addressed in this article. Figure 1 shows example observations from the mock data set. The time-to-event variable is stored in 2 columns: months and fracture. For example, the runner with ID No. 64 has months = 1 and fracture = 0, which means she dropped out of the study after 1 month and was fracture free up until this point. As with most prospective studies, the mock data set contains losses to follow-up—26 participants dropped out before 1 year, and an additional 19 participants dropped out before 2 years. Survival analysis adeptly handles these varying follow-up times: for example, a participant who is followed up for only 1 month contributes just 1 month of follow-up to the analysis. Sample data from the mock data set showing the 10 participants with the shortest follow-up times. The data set contains 5 variables: participant ID number (id), gender (1 = female, 0 = male), bone mineral density Z score (bmdz), and the 2 components of the time-to-event outcome: months in the study and fracture (1 = fractured, 0 = censored). The simplest statistics for time-to-event data are incidence rates and rate ratios. Incidence rates give the average rate at which events occur, whereas rate ratios compare the rates in different groups. To calculate a rate, we divide the number of events (eg, fractures) by the total number of “person-years.” One person-year could be 1 person followed up for 1 year, 2 people followed up each for a half year, or several people followed up for varying amounts of time that sum up to 1 year. In this study, the average time until fracture or censoring for the 200 participants was 17.1 months, or 1.425 years, and thus the total person-years was 200 * 1.425 = 285 person-years. The incidence rate of stress fracture was thus 56/285 = 0.196 fractures per 1 person-year, or equivalently, 1.96 fractures per 10 person-years. Person-year: A person-year is the denominator for an incidence rate. It is calculated by multiplying the total number of people at risk for an event by the average time they were at risk. The rate ratio indicates that women were sustaining fractures at triple the rate of men. Because incidence rates explicitly incorporate time, they provide more information than proportions. For example, 15% of men sustained fractures (Table 1), but this proportion would have been higher had the men been followed up longer (because more fractures would have had time to occur). Also, proportions can only be compared fairly when groups have identical follow-up times, which is rarely the case. For example, men accumulated more person-years than did women in this study, and thus dividing the proportion of fractures in women versus men—41%/15% = 2.73—provides an underestimation of the relative risk of fractures. The best-known method for visualizing time-to-event data is the Kaplan-Meier survival curve. Kaplan-Meier curves are fairly intuitive. The curves start at 100% and go down as events accumulate. Figure 2 shows the Kaplan-Meier curves for men and women (left panel) and the Kaplan-Meier curves when participants are divided into tertiles by bone density (right panel). Without knowing much about the underlying mathematics, it is fairly easy to discern that women sustain fractures more quickly than do men and that persons with low bone density sustain fractures more quickly than do persons with higher bone density. Left panel: Kaplan-Meier survival curves for men (blue) and women (red). The 2 curves are statistically distinct according to a log-rank test (P < .0001). Right panel: Kaplan-Meier survival curves by tertile of bone mineral density (red = lowest third, blue = middle third, and green = highest third). The P value from the log-rank test is .07, indicating that at least one curve may differ at a borderline level of significance. Survival curve: A survival curve is a curve that displays the probability of “surviving” event-free as time passes. The curve starts at 100% (everyone is event-free at baseline) and decreases as events occur. Specifically, the graphs display the probability of “surviving”—that is, avoiding the event of interest—in a given period. For example, the survival curve for men is 88% at 12 months, meaning that men had an 88% chance of remaining fracture free in the first year, or, equivalently, a 12% risk of sustaining a fracture in the first year. The log-rank test is used to formally compare the survival curves in different groups. According to the log-rank test, the curves for men and women are statistically distinct (P < .0001). If there was no censoring, the Kaplan-Meier survival probabilities would be trivial to calculate. For example, if the study starts with 100 men and 2 sustain a fracture in the first month, 10 in the first year, and 15 in the first 2 years, then the 1-month, 1-year, and 2-year survival probabilities would be 98%, 90%, and 85%, respectively. What makes the math trickier is the need to account for losses to follow-up. Table 2 shows the math for men for the first 10 months of the study, and Figure 3 shows the corresponding Kaplan-Meier curve. Men sustained fractures at 1, 2, 7, 8, 9, and 10 months, and thus the survival curve steps down at each of these time points. Because we calculate a new probability each time an event occurs, we have the opportunity to update our numbers to reflect dropouts. For example, the survival probability for month 7 has a denominator of 87 rather than 100 (prior to this month, 3 men sustained a fracture and 10 dropped out, and thus only 87 are still “at risk” of fracture). Once a man is censored, he is no longer included in the calculation of any subsequent survival probabilities; in this way, he contributes to the analysis only for the time for which he was followed up. Kaplan-Meier curve for the first 10 months for men. See Table 2 for details about how the survival probabilities are calculated. Kaplan-Meier methods are limited because they cannot handle continuous predictors (such as bone density measured as a continuous variable) or account for multiple predictors at once. Thus multivariate regression techniques are needed. Multivariate regression models for time-to-event outcomes estimate the incidence rate of the outcome for a given combination of predictors. To be precise, regression models estimate the instantaneous incidence rate, also known as the hazard rate. By modeling the hazard rate, we are recognizing that the incidence rate may vary over time. Hazard rate: The hazard rate is an instantaneous incidence rate. The hazard rate gives the probability of having the event in the exact next instant; it can change over time. If the hazard rate changes in a mathematically predictable way over time (eg, stays the same, strictly increases, or strictly decreases), then we can model it fully. The simplest choice is to assume that the hazard rate stays constant. For our mock data set, the model would be: ln (hazard rate of fracture) = intercept + βbmd * (BMD Z score) + βisFemale * (1 if female, 0 if male). For mathematical reasons, we model the natural log of the hazard rate, or ln(hazard rate). When I fit this model to the data, I get: ln (hazard rate) = −2.39 − 0.3 * (BMD Z score) + 1.08 * (1 if female, 0 if male). The model gives predicted hazard rates for various groups. For example, men with average bone density (Z = 0) have a predicted ln(hazard rate) of −2.39, and, thus, predicted hazard rate of exp−2.39 = 0.91 events per 10 person-years. Similarly, women with average bone density (Z = 0) have a predicted hazard rate of exp−2.39 + 1.08 = 2.69 events per 10 person years. Accordingly, the hazard ratio for women versus men is 2.69/0.91 = 2.94. (Equivalently, we can get the hazard ratio by exponentiating the β coefficient for gender: exp1.08 = 2.94.) These estimates are slightly different from the incidence rates and rate ratio we calculated earlier (see Table 1) because they have been adjusted for bone density. Hazard ratio: The hazard ratio is a measure of relative risk calculated by dividing 2 hazard rates. The hazard ratio is similar to the rate ratio but reflects instantaneous rather than average rates. Figure 4 displays our model graphically. The left panel shows a graph of the hazard rates for women and men over time. When the hazard rates are constant, the corresponding survival curves will follow an exponential function, as shown in the right panel. The survival curves look like a smoothed version of the Kaplan-Meier curves (compare with Figure 2, left panel). They are smooth because they are fit to a particular mathematical function (in this case, an exponential). We call models that assume a particular mathematical function “parametric.” Left panel: Hazard rates for men (blue) and women (red), with average bone density (Z = 0) based on an exponential regression model that includes both gender and bone mineral density as predictors. Right panel: Corresponding survival curves for men (blue) and women (red). When the hazard rate is constant, the survival curves have an exponential shape. Parametric regression models require us to make assumptions about the behavior of the hazard rate over time. Another type of regression, Cox regression, cleverly avoids any assumptions about the hazard rate by avoiding estimating it altogether. Cox regression is the most commonly used regression model for time-to-event data. The basic model is the same as before: ln (hazard rate) = intercept + βbmd * (BMD Z score) + βisFemale (1 if female, 0 if male). The intercept is the baseline hazard rate, which could change as a function of time and thus is potentially complicated. Cox regression simply avoids estimating the intercept. Because the aim of most studies is to estimate the effect of predictors (eg, BMD and gender), estimating the β coefficients is often sufficient. When I fit a Cox regression model to our data, I get: ln (hazard rate) = intercept (not estimated) −0.28 * (BMD Z score) + 1.08 * (1 if female, 0 if male). Exponentiating the β coefficients for bone density and gender gives the adjusted hazard ratios for these predictors. The hazard ratios estimated by Cox regression are very similar to those estimated by exponential regression (Table 3). Female gender is associated with about a tripling of the fracture rate, and a bone density that is 1 standard deviation higher is associated with about a 25% reduction in the fracture rate. Because Cox regression is so commonly used, it is worth discussing some features in more detail. Because hazard rates can change over time, hazard ratios can change too. For example, the relative difference in the fracture rates between women and men could narrow or grow over time. If this is the case, then a single hazard ratio doesn't fully capture the effect of gender. Thus, for each predictor, we need to test whether the hazards for the different groups are proportional over time. Unfortunately, we cannot reliably graph the hazard rates (because they are complicated and can change every instant). However, the assumption of proportional hazards can be tested in several other ways. A common approach is to examine the log-log survival plots, which are just a transformation of the Kaplan-Meier survival plots and are easily generated in most statistical packages. If these curves are parallel, this guarantees that the underlying hazard rates are proportional. For example, Figure 5 shows the log-log survival plots for gender and bone density tertiles. The lines are reasonably parallel for both predictors, and thus the assumption of proportional hazards is met. A: Log-log-survival plot for gender. B: Log-log survival plot for tertile of bone mineral density. If the log-log survival curves are roughly parallel (as is the case for both graphs here), this indicates that the proportional hazards assumption is met. When proportional hazards is violated, researchers need to consider alternative models—such as models that allow the hazard ratio to change over time—which may require help from a statistician. As with linear regression, Cox regression models the outcome—here the ln(hazard)—as a linear combination of predictors. Thus, we are assuming that continuous predictors have a linear relationship with the ln(hazard). Researchers need to check this assumption. Because it is too complicated to directly graph the hazard rate, residual plots are used. An example of a residual plot for testing linearity is provided in Figure 6. The Martingale residual plot can be used to test the linearity assumption for continuous predictors. Here, I obtained the Martingale residuals from a Cox regression model that omitted bone mineral density Z scores. I then graphed these Martingale residuals against bone mineral density Z score and superimposed a smoothing line. This line appears reasonably straight, which indicates that the bone mineral density has a linear relationship with the outcome. One of the most remarkable features of Cox regression is that it can accommodate time-changing predictors. For example, if we had measured the participants' bone density at baseline, 6 months, 1 year, and 1.5 years in the study, we could incorporate all 4 values when estimating the hazard ratio for bone density. In contrast, if we analyzed our data with logistic regression, we could only incorporate a single value for bone density (such as the baseline value). Logistic regression: Logistic regression is a regression model for binary outcomes. It compares the proportions of events in different groups and yields adjusted odds ratios. Cox regression can handle time-changing predictors because—as in the Kaplan-Meier methods—values for each predictor are updated every time there is an event. Thus, if a fracture occurs at 7 months, we can specify that the 6-month bone density values be used when modeling this event, but if a fracture occurs at 12.5 months, we can specify that the 1-year bone density values be used. The resulting hazard ratio for bone density will then reflect the most recent measurement of bone density. In this way, we increase statistical power/precision and more fully use the available data. Until now, this article has only focused on methods that analyze the time to first event—for example, the first stress fracture. However, Cox regression can handle recurrent events. For example, if some of our runners had multiple stress fractures, several potential approaches might be used to include these repeated fractures in the analysis. For example, in the counting process model, persons who have a first fracture are allowed to re-enter the risk set once their initial fracture has healed. The model corrects for the fact that the same person may then be contributing multiple (correlated) observations to the analysis. For further reading on modeling repeated events, see the article by Guo and colleagues 1. If researchers plan to use survival analysis, they need to collect data on the timing of events. Thus, they should ask about and record the dates on which events occurred. Exact dates are optimal, but survival analysis is still possible if time has been measured more crudely (eg, the month of occurrence or even the year of occurrence for multi-year studies). If researchers want to consider time-changing predictors, they should also plan to measure these variables at regular intervals. Survival analysis methods are readily implemented in most standard statistical packages. However, researchers are advised to consult with a statistician, particularly when dealing with time-changing predictors, predictors that violate proportional hazards, or repeated events. Several other situations that were not discussed in this article also warrant help from a statistician, including the existence of “competing events,” such as death from cancer versus death from other causes. Finally, for studies that span decades, researchers should consider using age rather than time in the study as the time scale. Again, help from a statistician may be required. When collecting longitudinal data on binary outcomes, researchers should consider using survival analysis methods, including generating incidence rates, Kaplan-Meier curves, and adjusted hazard ratios. These results are more informative and incorporate more information than proportions, risk ratios, or odds ratios. When running Cox regression models, researchers should test for proportional hazards, as well as for linearity (for continuous predictors). Extensions of the Cox model include the ability to accommodate time-changing predictors and repeated events, but implementing these aspects may require help from a statistician.

Full Text