Abstract

A clear understanding of linear regression analysis is of fundamental importance to quantitative research. In this editorial, I briefly discuss some of the key concepts; a comprehensive treatment is available in many textbooks, such as that by Kutner and associates.1Kutner M.H. Nachtsheim C.J. Neter J. Applied linear regression models.4th ed. McGraw-Hill/Irwin, New York2004Google Scholar Linear regression is used to describe the relationship of a continuous outcome measure to 1 or more explanatory or predictor measures. Consider a hypothetical research study on patients with an ophthalmologic disease comparing visual acuity outcomes (measured by logarithm of the minimal angle of resolution [logMAR]) for a novel therapy versus standard care. The simplest analysis of the data from this study is to use a 2-sample t test: doing so shows that the 100 therapy subjects (0.119 ± 0.162, mean ± standard deviation) had better visual acuity outcomes (lower mean logMAR) than the 100 subjects in usual care (0.179 ± 0.167), and that the difference was statistically significant (difference in means ± standard error of −0.060 ± 0.023, P = .010 from 2-sample t test).Equivalently, a linear regression can be used to do this analysis. The individual values of logMAR (Y) are regressed on an indicator for group membership (X = 1 for subjects in novel therapy and X = 0 for those in usual care) to get the best fit linear regression modelY = b0 + b1 X where b0 is the constant or intercept for the regression line (ie, the mean of Y when X is equal to 0) and b1 is the average increment in Y associated with a 1-unit increase in X. The results of the linear regression are shown in Table 1.TABLE 1Parameter Estimates, Standard Errors, and P Values From the Linear Regression of Visual Acuity (logMAR) on Treatment GroupParameterEstimate (SE)P Valueb0 (intercept)0.179 (0.016)<.001b1 (treatmenta1 for novel therapy, 0 for usual care.)−0.060 (0.023).010a 1 for novel therapy, 0 for usual care. Open table in a new tab To interpret these results in the context of the visual acuity study, the mean of logMAR for usual care was 0.179 (b0) and novel therapy subjects had logMAR values that averaged 0.060 less (b1 = −0.060) than those for usual care subjects (P = .010). Inference (ie, the standard error of the difference and the associated P value) for this parameter is exactly the same as that produced by the 2-sample t test.The linear regression presentation of the simple analysis does not add information to the t test results, so why go to the extra trouble? The answer is that other factors (eg, age and duration of disease) are likely to be associated with visual acuity; if these factors are not evenly distributed between the treatment groups then differences in these factors may be driving the apparent difference between the treatments. Indeed, for the example data, on average the novel therapy subjects were younger and had shorter disease durations than the usual care subjects. Linear regression can be used to obtain an estimate for the treatment group difference after adjusting for age and disease duration. The logMAR values are now regressed on the group indicator (X = 1 for novel therapy and X = 0 for usual care), age (subject age at time of study, in years), and DxDur (disease duration in years). The model isY = b0+ b1 X + b2 Age + b3 DxDur where now b1 is the difference in the mean logMAR for novel therapy compared to usual care, all other things being equal; put another way, b1 is the average difference in logMAR between sets of novel therapy and usual care subjects with comparable age and disease duration. The results of this regression are shown in Table 2.TABLE 2Results From the Linear Regression of Visual Acuity (logMAR) on Treatment Group, Age, and Disease DurationParametera1 for novel therapy, 0 for usual care.Estimate (SE)P Valueb0 (intercept)0.175 (0.015)<.001b1 (treatmenta1 for novel therapy, 0 for usual care.)−0.053 (0.022).015b2 (agebPredictor was centered (ie, the mean was subtracted off) before entry into the regression model.)0.004 (0.001)<.001b3 (Dx durationbPredictor was centered (ie, the mean was subtracted off) before entry into the regression model.)0.004 (0.002).022a 1 for novel therapy, 0 for usual care.b Predictor was centered (ie, the mean was subtracted off) before entry into the regression model. Open table in a new tab From Table 2, it can be seen that age and disease duration both have significant associations with visual acuity. However, the treatment difference remains very similar; the parameter b1 is estimated to be −0.053, thus even after adjusting for age and disease duration, novel therapy patients had logMAR values that were lower by 0.053 on average (P = .015) compared to usual care patients. Interpretation of the coefficients for age and disease duration is also straightforward. The estimates of b2 and b3 are both 0.004, which means that an additional year of age or of disease duration is associated with an increase in logMAR of 0.004. It is probably more natural to describe the results in terms of increments of more than 1 unit: eg, a 10-year increase in age is associated with an increase to logMAR of 10 × 0.004 = 0.04; a 5-year increase in disease duration is associated with an increase of 0.02 in logMAR. One final detail of interest is that both the age and disease duration variables were centered before analysis (ie, a new age variable was created by subtracting the mean age from each subject's age and similarly for disease duration). Mathematically, the regression model obtained will have the same fit and same parameter estimates for b1, b2, and b3. However, the centering allows the parameter b0 to have a natural interpretation: without centering it represents the average logMAR for subjects in the usual care arm (when X = 0), with age of 0 years and disease duration of 0 years (clearly a meaningless quantity); with centering it represents the logMAR for usual care subjects with age and disease duration equal to their overall averages in the sample.Having fit a linear regression model, an experienced analyst will proceed to examine a variety of diagnostic measures and model fit statistics. Some of the more important techniques include: computation of leverage (ie, outliers in the X variables), residual (ie, outliers in the Y variable), and influence (ie, a combination of leverage and residual) diagnostics; calculation of multicollinearity (ie, intracorrelation of X variables) measures; and graphical examination of a residual plot to assess nonlinearity and heteroscedasticity (eg, amount of spread of residuals increases for higher values of X). Although space limitations prohibit detailed discussion of these techniques, in brief, multicollinearity might be a concern, as age and duration of disease are moderately correlated. Computation of various diagnostics (not shown) indicates that the multicollinearity is not nearly severe enough to call into question the estimation results.Linear regression modeling is not used as frequently in medical research as logistic regression, as clinicians often prefer to dichotomize continuous outcomes. It can still be quite informative, though, to run linear regression on the continuous outcome as a supplementary analysis. Furthermore, many of the important considerations in a regression analysis, including the leverage of individual observations (that is, the degree to which an observation is an outlier in the predictor variables) and the degree of multicollinearity of the predictors (that is, the extent to which the information contained in 1 predictor is available from the remaining predictors), are the same for both the linear and the logistic regression model. Thus, if a particular logistic regression software package does not provide this information, the analyst can obtain it by running a linear regression with the same set of predictor variables.Many statistical methods can be viewed as special cases or extensions of linear regression analysis. The 2-sample t test, 1- and 2-way analysis of variance, and analysis of covariance all correspond to linear regression models. Most other regression techniques including logistic, Poisson, and Cox regression use the same type of additive modeling to include predictors in the model (ie, b0 + b1 X1 + b2 X2 + b3 X3 + …). Finally, longitudinal and clustered data modeling extends the linear regression model to allow for intrasubject correlation of multiple observations (eg, data on left and right eyes for each subject). Understanding the linear regression model greatly facilitates interpretation of these more complex techniques. A clear understanding of linear regression analysis is of fundamental importance to quantitative research. In this editorial, I briefly discuss some of the key concepts; a comprehensive treatment is available in many textbooks, such as that by Kutner and associates.1Kutner M.H. Nachtsheim C.J. Neter J. Applied linear regression models.4th ed. McGraw-Hill/Irwin, New York2004Google Scholar Linear regression is used to describe the relationship of a continuous outcome measure to 1 or more explanatory or predictor measures. Consider a hypothetical research study on patients with an ophthalmologic disease comparing visual acuity outcomes (measured by logarithm of the minimal angle of resolution [logMAR]) for a novel therapy versus standard care. The simplest analysis of the data from this study is to use a 2-sample t test: doing so shows that the 100 therapy subjects (0.119 ± 0.162, mean ± standard deviation) had better visual acuity outcomes (lower mean logMAR) than the 100 subjects in usual care (0.179 ± 0.167), and that the difference was statistically significant (difference in means ± standard error of −0.060 ± 0.023, P = .010 from 2-sample t test). Equivalently, a linear regression can be used to do this analysis. The individual values of logMAR (Y) are regressed on an indicator for group membership (X = 1 for subjects in novel therapy and X = 0 for those in usual care) to get the best fit linear regression modelY = b0 + b1 X where b0 is the constant or intercept for the regression line (ie, the mean of Y when X is equal to 0) and b1 is the average increment in Y associated with a 1-unit increase in X. The results of the linear regression are shown in Table 1. To interpret these results in the context of the visual acuity study, the mean of logMAR for usual care was 0.179 (b0) and novel therapy subjects had logMAR values that averaged 0.060 less (b1 = −0.060) than those for usual care subjects (P = .010). Inference (ie, the standard error of the difference and the associated P value) for this parameter is exactly the same as that produced by the 2-sample t test. The linear regression presentation of the simple analysis does not add information to the t test results, so why go to the extra trouble? The answer is that other factors (eg, age and duration of disease) are likely to be associated with visual acuity; if these factors are not evenly distributed between the treatment groups then differences in these factors may be driving the apparent difference between the treatments. Indeed, for the example data, on average the novel therapy subjects were younger and had shorter disease durations than the usual care subjects. Linear regression can be used to obtain an estimate for the treatment group difference after adjusting for age and disease duration. The logMAR values are now regressed on the group indicator (X = 1 for novel therapy and X = 0 for usual care), age (subject age at time of study, in years), and DxDur (disease duration in years). The model isY = b0+ b1 X + b2 Age + b3 DxDur where now b1 is the difference in the mean logMAR for novel therapy compared to usual care, all other things being equal; put another way, b1 is the average difference in logMAR between sets of novel therapy and usual care subjects with comparable age and disease duration. The results of this regression are shown in Table 2. From Table 2, it can be seen that age and disease duration both have significant associations with visual acuity. However, the treatment difference remains very similar; the parameter b1 is estimated to be −0.053, thus even after adjusting for age and disease duration, novel therapy patients had logMAR values that were lower by 0.053 on average (P = .015) compared to usual care patients. Interpretation of the coefficients for age and disease duration is also straightforward. The estimates of b2 and b3 are both 0.004, which means that an additional year of age or of disease duration is associated with an increase in logMAR of 0.004. It is probably more natural to describe the results in terms of increments of more than 1 unit: eg, a 10-year increase in age is associated with an increase to logMAR of 10 × 0.004 = 0.04; a 5-year increase in disease duration is associated with an increase of 0.02 in logMAR. One final detail of interest is that both the age and disease duration variables were centered before analysis (ie, a new age variable was created by subtracting the mean age from each subject's age and similarly for disease duration). Mathematically, the regression model obtained will have the same fit and same parameter estimates for b1, b2, and b3. However, the centering allows the parameter b0 to have a natural interpretation: without centering it represents the average logMAR for subjects in the usual care arm (when X = 0), with age of 0 years and disease duration of 0 years (clearly a meaningless quantity); with centering it represents the logMAR for usual care subjects with age and disease duration equal to their overall averages in the sample. Having fit a linear regression model, an experienced analyst will proceed to examine a variety of diagnostic measures and model fit statistics. Some of the more important techniques include: computation of leverage (ie, outliers in the X variables), residual (ie, outliers in the Y variable), and influence (ie, a combination of leverage and residual) diagnostics; calculation of multicollinearity (ie, intracorrelation of X variables) measures; and graphical examination of a residual plot to assess nonlinearity and heteroscedasticity (eg, amount of spread of residuals increases for higher values of X). Although space limitations prohibit detailed discussion of these techniques, in brief, multicollinearity might be a concern, as age and duration of disease are moderately correlated. Computation of various diagnostics (not shown) indicates that the multicollinearity is not nearly severe enough to call into question the estimation results. Linear regression modeling is not used as frequently in medical research as logistic regression, as clinicians often prefer to dichotomize continuous outcomes. It can still be quite informative, though, to run linear regression on the continuous outcome as a supplementary analysis. Furthermore, many of the important considerations in a regression analysis, including the leverage of individual observations (that is, the degree to which an observation is an outlier in the predictor variables) and the degree of multicollinearity of the predictors (that is, the extent to which the information contained in 1 predictor is available from the remaining predictors), are the same for both the linear and the logistic regression model. Thus, if a particular logistic regression software package does not provide this information, the analyst can obtain it by running a linear regression with the same set of predictor variables. Many statistical methods can be viewed as special cases or extensions of linear regression analysis. The 2-sample t test, 1- and 2-way analysis of variance, and analysis of covariance all correspond to linear regression models. Most other regression techniques including logistic, Poisson, and Cox regression use the same type of additive modeling to include predictors in the model (ie, b0 + b1 X1 + b2 X2 + b3 X3 + …). Finally, longitudinal and clustered data modeling extends the linear regression model to allow for intrasubject correlation of multiple observations (eg, data on left and right eyes for each subject). Understanding the linear regression model greatly facilitates interpretation of these more complex techniques. The author indicates no financial support. The author was involved in the design and conduct of the study; data collection, analysis, and interpretation; and preparation and review of the manuscript. The author would like to thank Fei Yu, UCLA Departments of Biostatistics and Ophthalmology, for helpful discussions related to the article.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call