Pooling stated and revealed preference data in the presence of RP endogeneity

John Paul Helveston,Elea Mcdonnell Feit,Jeremy J Michalek

doi:10.1016/j.trb.2018.01.010

John Paul Helveston, Elea Mcdonnell Feit + Show 1 more

Open Access

PDF Available

https://doi.org/10.1016/j.trb.2018.01.010

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Pooled discrete choice models combine revealed preference (RP) data and stated preference (SP) data to exploit advantages of each. SP data is often treated with suspicion because consumers may respond differently in a hypothetical survey context than they do in the marketplace. However, models built on RP data can suffer from endogeneity bias when attributes that drive consumer choices are unobserved by the modeler and correlated with observed variables. Using a synthetic data experiment, we test the performance of pooled RP–SP models in recovering the preference parameters that generated the market data under conditions that choice modelers are likely to face, including (1) when there is potential for endogeneity problems in the RP data, such as omitted variable bias, and (2) when consumer willingness to pay for attributes may differ from the survey context to the market context. We identify situations where pooling RP and SP data does and does not mitigate each data source’s respective weaknesses. We also show that the likelihood ratio test, which has been widely used to determine whether pooling is statistically justifiable, (1) can fail to identify the case where SP context preference differences and RP endogeneity bias shift the parameter estimates of both models in the same direction and magnitude and (2) is unreliable when the product attributes are fixed within a small number of choice sets, which is typical of automotive RP data. Our findings offer new insights into when pooling data sources may or may not be advisable for accurately estimating market preference parameters, including consideration of the conditions and context under which the data were generated as well as the relative balance of information between data sources.

Full Text