Regression analysis circa 1970 consisted of crude (by today's standards) computations, primarily hand plotting (if any) of results, and limited study of any alternatives to a single squares fit to one's data. Enter one misspelled control word or punch one incorrect value and one's entire box of punched cards would have to be resubmitted to the batch-processing window of the central lab. After each submission, it took an hour to a day, depending on the priority assigned to the job, to receive fixed-formatted newspaper-sized printed output that would be rubber-banded to the box of cards at the output table. Standard output included only one regression fit with regression coefficient estimates, F or t values for each, standard errors, an analysis of variance table, and possibly residuals. The alternative to this state-of-the-art computing was brutally tedious hand calculations using the few computational shortcuts available at the time-shortcuts that still required laborious calculations. This setting provides a basis for appreciating the monumental contributions of the regression classics selected for inclusion in this special issue of Technometrics. Each has dramatically changed the practice of regression analysis as it is advocated today. These four articles are introduced chronologically. Hoerl and Kennard's (1970a,b) companion articles Ridge Regression: Biased Estimation for Nonorthogonal Problems and Ridge Regression: Applications to Nonorthogonal Problems are among the most controversial written on the practice of regression analysis. Hoerl (1962) initially sought an alternative to squares because least squares estimates often do not make sense put into the context of the physics, chemistry, and engineering of the process which is generating the data (Hoerl and Kennard 1970a, p. 55). Thus, ridge regression was not first formulated as an alternative. to squares that had stated theoretically optimal properties. It was formulated to provide stable estimates normal equations were ill-conditioned due to collinear predictors (Hoerl 1962). The ridge-regression estimator 13(k) (X'X + kI)-1X'y naturally evolved from the application of the Marquardt-Levenberg solution to ill-conditioned systems of linear equations, as was then being applied to the ridge analysis of response surfaces. Marquardt (1970) nicely drew all these connections together in the broader context of generalized inverse solutions of normal equations. Ridge-regression theory, as espoused in these first articles, is built around existence theorems that prove that properly chosen ridge parameters k guarantee smaller expected squared error than squares (see also Theobald 1974). The authors also presented likelihood-based and Bayesian justifications for the ridge estimator. Apart from the existence theorems, many simulations were reported in virtually every statistics journal on methods for estimating the ridge pa ameter (e.g., Hoerl, Kennard, and Baldwin 1975; Dempster, Schatzoff, and Wermuth 1977). In these simulations there were always some ridge estimators that performed best or comparable to the best of the estimators investigated. The major criticisms of ridge regression generally focus on two issues. First, the existence theorems used one form of squared error loss function and required nonstochastic ridge parameters. Thus, the existence theorems do not apply to the usual setting where ridge parameters must be estimated from (e.g., Nelder 1972; Conniffe and Stone 1973). Second, Bayesian or other assumptions needed for the ridge estimator to be optimal in a well-defined theoretical sense are unrealistic in practice, yet simulations often inadvertently impose these very assumptions (e.g., Draper and Van Nostrand 1979). Although ridge regression is widely used in the application of regression methods today, it remains as controversial as it was first introduced (e.g., Draper and Smith 1998, chap. 17). Mallows's (1973) derivation of theoretical properties of the Cp statistic greatly strengthened the basis for the application of this graphical method for selecting better subsets in regression analyses there are many possible predictor variables. Discussion of Mallows's Cp was first published by Gorman and Toman (1966), although Mallows had twice earlier discussed it in oral presentations. Gorman and Toman were primarily concerned with reducing the number of regression fits that had to be computed to identify better subsets (best subset algorithms were not yet available) because when k exceeds 7 or 8 the number of computations becomes large even for an electronic computer (p. 28). They advocated computing fractions of the 2k -1 possible fits based on the magnitudes of t statistics from a fit to the complete set of predictors. This approach was one of several that attempted to identify the better subsets without
Read full abstract