Reconciling Conflicting Evidence on the Performance of Propensity-Score Matching Methods

Jeffrey A Smith,Petra E Todd

doi:10.1257/aer.91.2.112

Abstract

There is a long-standing debate in the literature over whether social programs can be reliably evaluated without a randomized experiment. This paper summarizes results from a larger paper (Smith and Todd, 2001) that uses experimental data combined with nonexperimental data to evaluate the performance of alternative nonexperimental estimators. The impact estimates based on experimental data provide a benchmark against which to judge the performance of nonexperimental estimators. Our experimental data come from the National Supported Work (NSW) Demonstration and the nonexperimental data from the Current Population Survey (CPS) and the Panel Study of Income Dynamics (PSID). These same data were used in influential papers by Robert LaLonde (1986), James Heckman and Joseph Hotz (1989), and Rajeev Dehejia and Sadek Wahba (1998, 1999). We focus on a class of estimators called propensity-score matching estimators, which were introduced in the statistics literature by Paul Rosenbaum and Donald Rubin (1983). Traditional propensity-score matching methods pair each program participant with a single nonparticipant, where pairs are chosen based on the degree of similarity in the estimated probabilities of participating in the program (the propensity scores). More recently developed nonparametric matching estimators described in Heckman et al. (1997, 1998a, b) use weighted averages over multiple observations to construct matches. We apply both kinds of estimators in this paper. Heckman et al. (1997, 1998a, b) evaluate the performance of matching estimators using experimental data from the U.S. National Job Training Partnership Act (JTPA) Study combined with comparison group samples drawn from three sources. They show that data quality is a crucial ingredient to any reliable estimation strategy. Specifically, the estimators they examine are only found to perform well in replicating the results of the experiment when they are applied to comparison group data satisfying the following criteria: (i) the same data sources (i.e., the same surveys or the same type of administrative data or both) are used for participants and nonparticipants, so that earnings and other characteristics are measured in an analogous way, (ii) participants and nonparticipants reside in the same local labor markets, and (iii) the data contain a rich set of variables relevant to modeling the program-participation decision. If the comparison group data fails to satisfy these criteria, the performance of the estimators diminishes greatly. More recently, Dehejia and Wahba (1998, 1999) have used the NSW data (also used by LaLonde) to evaluate the performance of propensity-score matching methods, including pairwise matching and caliper matching. They find that these simple matching estimators succeed in closely replicating the experimental NSW results, even though the comparison group data do not satisfy any of the criteria found to be important in Heckman et al. (1997, 1998a). From this evidence, they conclude that matching approaches are generally more reliable than traditional econometric estimators. In this paper, we reanalyze the NSW data in an attempt to reconcile the conflicting findings * Smith: Department of Economics, University of Western Ontario, Social Science Centre, London, ON, N6A 5C2, Canada, and NBER; Todd: Department of Economics, University of Pennsylvania, 3718 Locust Walk, Philadelphia, PA 19104, and NBER. Versions of the paper that is the source for this paper have been presented at a 2000 meeting of the Institute for Research on Poverty in Madison, Wisconsin, at the Western Research Network on Employment and Training summer workshop (August 2000), at the Canadian International Labour Network meetings (September 2000), and at the University of North Carolina. We thank James Heckman for comments on this paper, and we thank Robert LaLonde for comments and for providing us with the data from his 1986 study. We thank Rajeev Dehejia for providing us with information helpful in reconstructing the samples used in the Dehejia and Wahba (1998, 1999) studies. Jingjing Hsee and Miana Plesca provided excellent research assistance.

Full Text