A Comment on the Utility of Recursive Partitioning

Brent A Blumenstein

doi:10.1200/jco.2005.12.908

Abstract

In clinical medicine, situations arise in which it is desirable to investigate relationships between an outcome and a set of putatively explanatory variables. At one extreme, the investigation can come from a well-designed study targeted specifically at the relationship between the outcome and a single variable, and at the other extreme, it can come from a convenience data set in which no single specific explanatory variable is of dominating interest. Optimally, the outcome of interest represents something directly related to patient benefit (such as survival time) or modification of the configuration of intervention, but limitations of study design or interpretation can lead to a focus on an indirect outcome measure or a surrogate of patient benefit. Of special interest are studies in which the analysis leads to a greater understanding of prognosis or a diagnosis. Analyses of prognosis can lead to greater understanding of the applicability of treatment options, the stratification of patients into risk groups, or the estimation of expected outcome. Diagnosis is related to prognosis but is focused on a discrete outcome representing a patient status resulting from a diagnostic procedure and, therefore, is related to the configuration of intervention. There are numerous useful analytic methods for data of this type, and often, multiple methods are applicable to the same data. The most common classes of methods are regression models (such as logistic or proportional hazard regression), artificial neural networks, or recursive partitioning. In the most general terms, regression methods are most appropriate when the objective is to quantify the relative contribution of the explanatory variables. Artificial neural networks are preferred by some when the goal is to predict or classify. Recursive partitioning methods provide insights into the data structure or useful data partitioning schemes. Garzotto et al analyzed prostate cancer diagnosis data using a recursive partitioning methodology, specifically classification and regression tree analysis. Data from 1,173 men who had undergone a prostate biopsy were analyzed. The outcome was the existence of cancer in the biopsy (an outcome affecting intervention trajectory) and the putatively explanatory variables included prostate-specific antigen (PSA), PSA density, transrectal ultrasound features, age, prostate volume, family history, race, type of referral, history of vasectomy, and digital rectal examination results. Their goal was to build a decision tree, or algorithm, for deciding which patients should undergo prostate biopsy and to study how the algorithm suggested by their data differs from others. Prior data sets of this nature (and there are many) have been primarily analyzed using logistic regression and artificial neural networks. Garzotto et al also analyzed their data using logistic regression and found similar diagnostic sensitivity and specificity performance. The biggest difference between classification and regression tree and logistic regression is in the manner of presentation of results and in the insights gained from the analysis. This article provides an excellent example of the application of methods based on recursive partitioning. The results show how men in the common position of having a suspicion of prostate cancer can be partitioned into groups that are likely to have a positive biopsy and groups that are not likely to have a positive biopsy. The analysis shows the relative importance of the explanatory variables, and the putative explanatory variables found to not contribute to the partitioning are identified. The reader is also shown how special difficulties are handled, such as skewness in the distribution of PSA values. Finally, the authors illustrate the necessary step of model validation, but the type of validation performed is of the weakest variety because it used a 10% holdout of cases from the original data they collected. The stronger type of validation, called external validation (instead of internal validation), uses data collected from other settings. The authors claim that the algorithm based on their partitioning would have reduced the number of unnecessary biopsies by 31.3%, while, at the same time, maintaining high-sensitivity performance. The direction of changes in the manner in which biopsy decisions are typically made JOURNAL OF CLINICAL ONCOLOGY E D I T O R I A L VOLUME 23 NUMBER 19 JULY 1 2005

Full Text