The extant literature using household scanner data to estimate consumer choice models has identified two key sources of bias in estimated mean responses to marketing variables. Omitted heterogeneity may bias mean responses towards zero. At the same time, omitted time-varying characteristics of alternatives that influence consumer choices may also bias mean responses towards zero if these characteristics are correlated with observed factors such as price - the endogeneity bias. Both these issues have been well recognized, and methods have been proposed to address them using household scanner panel data. However, when estimating a choice model with these data at the SKU or the UPC level, one may not observe choices for each item in each of the time periods under consideration. Without such information, one cannot control for item and time period specific unmeasured characteristics, as there is no information on alternatives during those periods in which they are not purchased by any of the panelists. In general, when a product category has many alternatives, each with fairly small shares, the household sample may not contain sufficient choices for each alternative, negatively impacting the ability to control for endogeneity with household data. In contrast, as aggregate store-level data are the true aggregation of purchases by all households visiting the store, they contain the time-period specific item level information required to account for endogeneity as long as each item has some sales in each time period. Given the relative merits of household data to estimate the distribution of heterogeneity and store-level data to address the endogeneity problem, we propose an integrated estimation procedure that uses the information in both sources. Our approach provides consistent estimates of the mean responses to marketing variables and the heterogeneity distribution and also controls for potential endogeneity due to correlation between unmeasured item-level characteristics and prices.
Read full abstract