Skip to main content
Top
Published in:

Open Access 29-12-2022

Shrinkage priors for high-dimensional demand estimation

Authors: Adam N. Smith, Jim E. Griffin

Published in: Quantitative Marketing and Economics | Issue 1/2023

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Estimating demand for large assortments of differentiated goods requires the specification of a demand system that is sufficiently flexible. However, flexible models are highly parameterized so estimation requires appropriate forms of regularization to avoid overfitting. In this paper, we study the specification of Bayesian shrinkage priors for pairwise product substitution parameters. We use a log-linear demand system as a leading example. Log-linear models are parameterized by own and cross-price elasticities, and the total number of elasticities grows quadratically in the number of goods. Traditional regularized estimators shrink regression coefficients towards zero which can be at odds with many economic properties of price effects. We propose a hierarchical extension of the class of global-local priors commonly used in regression modeling to allow the direction and rate of shrinkage to depend on a product classification tree. We use both simulated data and retail scanner data to show that, in the absence of a strong signal in the data, estimates of price elasticities and demand predictions can be improved by imposing shrinkage to higher-level group elasticities rather than zero.
Notes
We thank IRI for making the data available. All estimates and analysis in this paper based on data provided by IRI are by the authors and not by IRI. Accompanying R code is available at https://​github.​com/​adam-n-smith/​hierarchical-shrinkage.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Measuring price and promotional effects from store-level transaction data is a mainstay of economics and marketing research. However, many challenges arise in model specification and inference when the product space is large. For example, demand models for many goods should be flexible so as to not impose strong prior restrictions on cross-price effects, but flexible models contain many parameters so some form of dimension reduction or regularization is needed to avoid overfitting. While there is a large literature on regularized estimation, existing methods typically assume that the underlying parameter vector is sparse and so shrinkage points are fixed at zero. In this paper, we demonstrate the value of non-sparse, data-driven shrinkage in high-dimensional demand models.
Our primary contribution is to develop Bayesian shrinkage priors for pairwise product substitution parameters that may not be sparse but instead conform to a low-dimensional structure. We propose a hierarchical extension of the class of global-local priors (Polson & Scott, 2010) that is parameterized using information from a product classification tree. Typical retail scanner panel data sets, such as those provided by IRI and Nielsen, include product classification tables where goods are partitioned into broad categories (e.g., “Salty Snacks”), subcategories (“Potato Chips”), brands (“Lay’s”), and ultimately UPCs (“Lay’s Classic Potato Chips 20 oz.”). We explicitly use this tree structure to parameterize: (i) prior means, so that substitution between two goods will depend on the nature of substitution between their respective subcategories and categories; (ii) prior variances, so that shrinkage can propagate down through the tree. We also consider different forms of shrinkage at each level of the tree—e.g., ridge (Hoerl & Kennard, 1970) vs. horseshoe (Carvalho et al., 2010)—and provide novel results on the shrinkage behavior of the induced marginal priors.
We apply the proposed hierarchical shrinkage priors to a log-linear demand system with hundreds of goods and thousands of cross-price elasticity parameters. Log-linear models remain widely used in elasticity-based pricing applications because of their simple yet flexible functional form. In practice, however, researchers typically only allow a small subset of “within category” cross-price effects to enter the demand equation for any good (e.g., Hausman et al., 1994; Montgomery, 1997; Hitsch et al., 2021; Semenova et al., 2021) or even omit cross effects altogether (e.g., DellaVigna & Gentzkow, 2019). We contribute to this literature by allowing a more complete, high-dimensional set of prices to enter each demand equation and then use the proposed shrinkage priors to do data-driven regularization on price elasticity parameters. We also propose a posterior computational strategy that partially mitigates the curse of dimensionality arising from estimating log-linear demand systems at scale.
We use both simulated and actual retail scanner data to highlight the value of hierarchical non-sparse shrinkage. In our simulation experiments, we show that the proposed priors are especially useful in settings where the true elasticity matrix is either: (i) dense, meaning all pairs of goods have a non-zero cross-price elasticity; (ii) dense but becomes sparse after subtracting off the prior mean, meaning all pairs of product groups have a non-zero cross-price elasticity and many product-level elasticities are exactly equal to the group-level elasticity; or (iii) group-wise sparse, meaning many group-level cross elasticities are zero and this sparsity is inherited by product-level elasticities. In our empirical application, we use store-level data to estimate a high-dimensional log-linear demand system with 275 products that span 28 subcategories and nine categories. We show improvements in demand predictions and estimated elasticities from imposing non-sparse hierarchical shrinkage. Relative to the best-performing sparse prior, the best-performing hierarchical prior leads to a 4.5% improvement in predictions across all products, a 5.7% improvement for products with extrapolated price levels in the holdout sample, and a 6.8% improvement for products with limited price variation. In addition, we find that imposing hierarchical shrinkage leads to larger and more dispersed cross-price elasticities as well as more negative and precisely estimated own-price elasticities. We believe that being able to produce more economically reasonable elasticity estimates by simply imposing shrinkage to higher-level category elasticities is a strength of our approach. Finally, we produce competitive maps from each estimated 275 × 275 price elasticity matrix and show that hierarchical shrinkage priors lead to a more interpretable analysis of product competition and market structure.
Methodologically, our work relates to the literature on Bayesian shrinkage priors for high-dimensional regression models (Polson & Scott, 2012a; Bhadra et al., 2019). This literature has largely focused on the problem of sparse signal recovery in which the parameter vector is assumed to be zero except for a few possibly large components. Our goal is to illustrate how many existing shrinkage technologies can still be used for detecting deviations from some non-zero mean structure. Because our prior is hierarchical in nature, it also relates to the hierarchical penalties of Bien et al. (2013) and hierarchical priors of Griffin and Brown (2017) developed to control shrinkage in linear regression models with interaction terms. These penalties/priors control the rate of shrinkage of regression effects at different levels of the hierarchy, allowing for higher-order interactions to be present only when the main effects are present. In contrast, our priors control both the rate and direction of shrinkage and have applications beyond regression models with many interactions.
We also contribute to a rich literature on shrinkage estimation of price and promotional elasticities. For example, Blattberg and George (1991), Montgomery (1997), and Boatwright et al. (1999) estimate Bayesian hierarchical models that pool information across stores to improve elasticity estimates. Rather than exploiting a hierarchical structure over stores, we impose a hierarchical structure over products in order to estimate high-dimensional regression-based demand systems. Moreover, Montgomery and Rossi (1999), Fessler and Kasy (2019), and Smith et al. (2019) demonstrate the benefits of constructing priors that shrink towards economic theory. Relative to Smith et al. (2019), who also consider group-wise restrictions on price elasticity parameters in a log-linear demand system, we explicitly focus on a high-dimensional setting in which there are more prices and cross-price effects than there are observations. For example, while Smith et al. (2019) estimate demand for up to 32 goods spanning eight subcategories and one category, we estimate demand for 275 goods spanning 28 subcategories and nine categories. Smith et al. (2019) also impose group-wise equality restrictions on elasticity parameters, which is a very dogmatic form of shrinkage. Ensuring shrinkage points are correctly specified is a first-order concern and motivates their approach of estimating the grouping structure from the data. While this offers the ability to uncover boundaries in substitution, it comes at a severe cost of scalability and would not be feasible in our high-dimensional setting. Although the mean structure of our hierarchical prior is not explicitly derived from microeconomic theory, the idea of group-wise shrinkage is similar in spirit to the dimension reduction offered by the theory of separable preferences and multi-stage budgeting (Goldman and Uzawa, 1964; Gorman, 1971; Deaton & Muellbauer, 1980; Hausman et al., 1994).
Lastly, we contribute to a growing literature on large-scale demand estimation. One strand of literature has focused on modeling consumer choice among a high-dimensional set of substitutes from one product category. A variety of large-scale discrete choice models have been proposed and made tractable through regularization (Bajari et al., 2015), random projection (Chiong & Shum, 2018), and the use of auxiliary data on consumer consideration sets and search (Amano et al., 2019; Morozov, 2020). A second strand has focused on modeling demand for wider assortments spanning multiple categories and has exploited the flexibility of deep learning (Gabel & Timoshenko, 2021) and representation learning methods such as word embeddings (Ruiz et al., 2020) and matrix factorization (Chen et al., 2020; Donnelly et al., 2021).1 There is also a well-established literature on economic models of multi-category demand (e.g., Manchanda et al., 1999; Song & Chintagunta, 2006; Mehta, 2007; Thomassen et al., 2017; Ershov et al., 2021). However, with the exception of Ershov et al. (2021), microfounded multi-category models become intractable at scale and have only been estimated on data with relatively small choice sets spanning a few categories.
There are a few ways in which the methods in this paper differ from, and potentially complement, existing large-scale methods. The first difference is in the data required. Our estimation framework—comprising a log-linear demand system with hierarchical shrinkage priors on price elasticity parameters—uses aggregate store-level data while most papers discussed above use household-level purchase basket data. Although granular purchase data sets are becoming more ubiquitous, many marketing researchers continue to rely on store-level data to estimate price and promotional effects (Hitsch et al., 2021). Our framework thus adds to the toolkit for anyone wanting to forecast demand and estimate large cross-price elasticity matrices with aggregate data.
The second difference is in the assumptions about functional forms and the consumer choice process. Log-linear models do not require explicit assumptions about the functional form of consumer utility and can instead be viewed as a first-order approximation to some Marshallian demand system (Diewert, 1974; Pollak and Wales, 1992).2 Log-linear models are also directly parameterized by price elasticities, which is convenient when elasticities are focal objects of interest. That said, log-linear models do not allow for inferences about preference heterogeneity, are not guaranteed to satisfy global regularly conditions, and are less scalable—at least when modeled in a joint, seemingly unrelated regression system—than many existing machine learning methods.
Finally, we note that the novel contribution of this paper is not in the specification of demand, but instead in the approach to regularize high-dimensional demand parameters within a given functional form. We see at least two ways in which this paper can complement existing methods. First, estimating substitution patterns requires sufficient variation in key marketing variables such as price, which can be challenging in many retail settings. One common solution in empirical work is to exclude goods from the analysis whose prices either do not vary or perfectly covary with other prices (e.g., Donnelly et al., 2021). In contrast, our hierarchical shrinkage approach pools information across goods to produce reasonable estimates of cross-price elasticities for all products, even those with little to no price variation. Second, while we use log-linear demand as a leading case, we believe that hierarchical shrinkage priors can facilitate the estimation of many other demand systems at scale. Examples include models based on quadratic utility functions (Wales and Woodland, 1983; Thomassen et al., 2017) and discrete choice over product bundles (Gentzkow, 2007; Song & Chintagunta, 2006; Ershov et al., 2021), both of which contain sets of pairwise substitution parameters that grow quadratically in the number of goods.
The remainder of this paper is organized as follows. Section 2 defines the log-linear demand system and outlines existing approaches for imposing sparse shrinkage, including global-local priors. Section 3 develops hierarchical global-local priors. Section 4 discusses posterior computation strategies. Section 5 presents results from a set of simulation experiments and Section 6 presents results of an empirical application to store-level scanner data. Section 7 concludes.

2 Background: Regularizing high-dimensional demand

In this paper we propose a non-sparse, hierarchical shrinkage approach to improve the estimation of high-dimensional demand models. Our priors apply to demand models characterized by a set of pairwise product substitution parameters whose dimension grows non-linearly in the number of goods. Throughout the paper, we use log-linear demand as a leading example. Before introducing our shrinkage framework, we first outline the specification of a log-linear demand system and standard existing approaches for regularization.

2.1 Demand specification

Consider an assortment of products indexed by \(i=1,\dotsc ,p\). For each product, we observe its unit sales qit and price pit across weeks \(t=1,\dotsc ,n\). In a log-linear demand system, the log of unit sales of product i at time t is regressed on its own log price, the log prices of all other products, and a vector of controls which can include product intercepts, seasonal trends, and promotional activity:
$$ \log q_{it} = \beta_{ii}\log p_{it} + \sum\limits_{j\ne i}\beta_{ij}\log p_{jt} + c_{it}^{\prime}\phi_{i} + \varepsilon_{it}. $$
(1)
The p × 1 error vector \(\varepsilon _{t}=(\varepsilon _{1t},\dotsc ,\varepsilon _{pt})\) is assumed to follow a N(0,Σ) distribution where the covariance matrix Σ captures any unobserved, contemporaneous correlations between goods.
Log-linear demand systems are attractive for a few reasons. First, the price coefficients represent own and cross-price elasticities, which are often focal economic objects in the analysis of market structure and pricing/promotion schedules. Second, the functional form is simple and admits tractable estimators of all model parameters. Third, the model is flexible in that there are no restrictions imposed on βij’s so we can accommodate both substitutes (βij > 0) and complements (βij < 0). Log-linear models are also flexible in the sense that they possess enough parameters to serve as a first-order approximation to a valid Marshallian demand system (Diewert, 1974; Pollak & Wales, 1992). However, because there are p2 elasticities in a system with p goods, regularization is needed at scale.

2.2 Sparse shrinkage and global-local priors

We now outline a standard sparse approach to regularizing parameters in the log-linear demand system in Eq. 1 which will both lay the groundwork for, and also serve as a benchmark against, the hierarchical priors we construct in the following section. We take a Bayesian approach to estimation and so regularization will be imparted by way of the prior. There are now many prior specifications that can be used for elasticity parameters βij, each having different tail behavior and thus inducing different forms of shrinkage (see, e.g., Bhadra et al., 2019). We specifically focus on global-local priors (Polson & Scott, 2010), which are scale mixtures of normals:
$$ \begin{array}{@{}rcl@{}} \beta_{ij} |\lambda_{ij}^{2},\tau^{2}\!&\!\sim& \text{N}(0,\tau^{2}\lambda_{ij}^{2}),\\ \lambda_{ij}^{2}\!&\!\sim& G. \end{array} $$
(2)
Here τ2 is the “global” variance that controls the overall amount of shrinkage across all elasticity parameters βij while \(\lambda _{ij}^{2}\) is a “local” variance that allows for component-wise deviations from the shrinkage imposed by τ2. The local variances are distributed according to a mixing distribution G, which is often assumed to be absolutely continuous and will thus admit an associated density g(⋅).3
One of the reasons that global-local priors are attractive is that, given an appropriate choice of mixing density, they can exhibit two useful properties for sparse signal recovery: (i) a spike of mass at zero to encourage sparsity for small noisy observations; and (ii) heavy tails to leave large signals unshrunk (Polson & Scott, 2010). Together, (i) and (ii) induce shrinkage behavior that mimics the hard selection rules of Bayesian two-group, spike-and-slab priors (Mitchell & Beauchamp, 1988; George & McCulloch, 1993; Ishwaran & Rao, 2005) while admitting much more efficient posterior computation strategies.
The global-local structure is also very general and nests many common Bayesian shrinkage priors. For example, ridge regression (Hoerl & Kennard, 1970) arises when \(\lambda _{ij}^{2}=1\), the Bayesian lasso (Park & Casella, 2008; Hans, 2009) arises when \(\lambda _{ij}^{2}\) follows an exponential distribution, and the horseshoe (Carvalho et al., 2010) arises when λij follows a half-Cauchy distribution. The priors can be compared using their shrinkage profiles, which measure the amount by which the posterior mean shrinks the least squares estimate of a regression coefficient to zero, and is related to the shape of the mixing density g(⋅). For example, the tails of an exponential mixing density are lighter than the polynomial tails of the half-Cauchy, suggesting that the Bayesian lasso may tend to over-shrink large regression coefficients and under-shrink small ones relative to the horseshoe (Polson & Scott, 2010; Datta & Ghosh, 2013; Bhadra et al., 2016).
Although the tail behavior and associated shrinkage properties of global-local priors have now been studied extensively, existing work typically considers the problem of sparse signal recovery and so the prior means in Eq. 2 are assumed to be fixed at zero. However, economic theory offers many reasons for why sparse shrinkage may not be appropriate for price elasticity parameters. First, own-price effects should be negative by the law of demand and cross-price effects in a Marshallian demand system need not be zero even if true substitution effects are zero. Second, the property of Cournot aggregation (or “adding-up”) implies that demand must become more elastic as the set of available substitutes increases. If cross-effects are arbitrarily shrunk towards zero, then the magnitude of the own effects must fall, which would lead to estimates of own-price effects that are biased downward in magnitude. Third, line pricing can produce highly correlated prices among substitutable goods, which would lead to heavy shrinkage on the associated cross-price effects in estimation. However, fixing the shrinkage points to zero would suggest that varieties within the product line are unrelated even though the very nature of price correlation reflects a high degree of substitution. Together, these examples suggest that care must be taken when regularizing price effect parameters, which motivates our development of non-sparse shrinkage priors.

3 Hierarchical global-local priors

In this section, we develop hierarchical global-local priors with two goals in mind. The first is to allow own and cross-price elasticities to be shrunk towards higher-level group elasticities (rather than zero), where the hierarchy of groups is defined using a pre-specified product classification tree. The second is to allow shrinkage to propagate down through the tree. To this end, we extend the standard global-local specification by adding a hierarchical mean structure to connect shrinkage points across levels in the tree, as well as a hierarchical product structure on the local variances to connect the degree of shrinkage across levels in the tree.

3.1 Notation

We first define a minimum of notation. We assume that the researcher has access to an L-level product classification tree where levels are indexed by \(\ell =0,1,\dotsc ,L-1\) and are ordered such that = 0 is the lowest level corresponding to products, = 1 corresponds to most granular definition of product groups, and = L − 1 corresponds to the highest level and most coarse definition of product groups. Further define n to be the number of nodes (groups) on level where n0 > n1 > ⋯ > nL− 1.
The indexing notation we introduce is slightly more complicated than the usual indexing for graphical models because, in our case, the target parameters are always defined with respect to two nodes (e.g., the price elasticity between two products), not one. So instead of considering the parent of node i, we must consider the parent of (i,j), which itself will be a pair of nodes. We now define the indexing function that will be used throughout.
Definition 1
Let (i,j) denote a pair of nodes at the lowest level (level 0) of the classification tree. Then let π(ij|) denote the parent index function which produces the level- parent nodes of (i,j).
To build intuition, consider the following two products: (i) Lay’s potato chips, which belongs to the “Potato Chip” subcategory (level 1) and “Salty Snack” category (level 2); and (ii) Heineken beer, which belongs to the “Imported Beer” subcategory (level 1) and “Beer” category (level 2). Then the parents of (Lay’s, Heineken) on levels 1 and 2 are as follows.
$$ \begin{array}{@{}rcl@{}} \pi(\text{Lay's},\text{Heineken}|1)&=&(\text{Potato Chips},\text{Imported Beer})\\ \pi(\text{Lay's},\text{Heineken}|2)&=&(\text{Salty Snacks},\text{Beer}) \end{array} $$
Given a product-level elasticity βij, we let 𝜃π(ij|m) represent the relationship between the level-m ancestors of i and j. An example of this lineage of parameters is shown in Fig. 1. The darkest square in the left-hand-side grid denotes βij. The level-1 parent of (i,j) is π(ij|1) = (4,2) and the level-2 parent of (i,j) is π(ij|2) = (2,1). The idea is to then direct shrinkage of βij towards 𝜃π(ij|1) (grid in the middle), which is in turn shrunk towards 𝜃π(ij|2) (grid on the right).

3.2 Prior construction

Using the indexing notation above, we now introduce a sequence of prior distributions on the elasticity parameters of the log-linear demand model in Eq. 1. Starting at the lowest level of the tree, define the prior on the (i,j) product pair as:
$$ \beta_{ij} \sim \text{N}\left( \theta_{\pi(ij|1)},\tau_{\beta}^{2}{\Psi}_{ij}\right) $$
(3)
which is a global-local prior with global variance \(\tau _{\beta }^{2}\), local variance Ψij, and prior mean 𝜃π(ij|1). Note that this notation holds for any (i,j) pair including the own elasticities where i = j. However, in order to account for differences in the expected signs of own and cross-price elasticities, we will ultimately build up two separate hierarchical structures: one for the βii’s and one for the βij’s. The former will shrink the product-level own elasticities towards higher-level own-price elasticities whereas the latter will shrink product-level cross elasticities towards higher-level cross-price elasticities. For ease of exposition, we will focus our prior construction on the case where ij and the βij’s are cross elasticities.
Hierarchy of means
We specify global-local priors on the higher-level group elasticities:
$$ \theta_{\pi(ij|\ell)} \sim \text{N}\left( \theta_{\pi(ij|\ell+1)},\tau_{\ell}^{2}{\Psi}_{\pi(ij|\ell)}\right), \qquad \ell=1,\dotsc,L-1, $$
(4)
where \(\tau _{\ell }^{2}\) is the global variance across all level- elasticities 𝜃π(ij|), Ψπ(ij|) is a local variance, and 𝜃π(ij|+ 1) is the parent cross-group elasticity to 𝜃π(ij|). This hierarchical mean structure allows the direction of shrinkage to be governed by the classification tree. That is, in the absence of a strong signal in the data, elasticities will be shrunk towards higher-level elasticity parameters instead of zero.
Hierarchy of variances
We also impose a hierarchical structure on the local variance parameters. At the product level, we have:
$$ {\Psi}_{ij}=\lambda_{ij}^{2} \prod\limits_{k=\ell+1}^{L-1} \lambda_{\pi(ij|k)}^{2}, \quad \lambda_{ij}^{2} \sim G_{\beta}, $$
(5)
and at higher levels in the tree, we have:
$$ {\Psi}_{\pi(ij|\ell)}=\lambda_{\pi(ij|\ell)}^{2} \prod\limits_{k=\ell+1}^{L-1} \lambda_{\pi(ij|k)}^{2}, \quad \lambda_{\pi(ij|\ell)}^{2} \sim G_{\ell}, \quad \ell=1,\dotsc,L-1. $$
(6)
Here \(\lambda _{ij}^{2}\) and \(\lambda _{\pi (ij|\ell )}^{2}\) are variances associated with βij and 𝜃π(ij|), respectively, and represent local variances absent any hierarchical structure connecting variances across levels. That is, without a hierarchy of variances we would simply have \({\Psi }_{ij}=\lambda _{ij}^{2}\) and \({\Psi }_{\pi (ij|\ell )}=\lambda _{\pi (ij|\ell )}^{2}\). With the product hierarchy of variances, the induced local variances Ψ will be small whenever either \(\lambda _{\pi (ij|\ell )}^{2}\) is small or any \(\lambda _{\pi (ij|s)}^{2}\) is small for s > (i.e., higher levels in the tree), which allows shrinkage to propagate down through the tree. Taken together with the hierarchical mean structure, the hierarchal variance structure implies that, in the absence of a strong signal in the data, price elasticities will be strongly shrunk towards higher-level elasticities and these higher-level group elasticities will be strongly shrunk towards each other.
Examples
A summary of model parameters is provided in Table 1. Below we provide examples of the complete hierarchical prior specification for trees different numbers of levels. A single-level tree corresponds to a prior for βij that has a fixed mean \(\bar {\theta }\) and does not encode any information about product groups. If the tree has at least two levels, then βij will be shrunk towards its parent elasticity and shrinkage will be allowed to propagate. Note that the level at which shrinkage begins to propagate is also a choice of the researcher. In the examples below, shrinkage starts to propagate at the top level ( = L − 1) of the tree.
(a)
One-level prior
$$ \beta_{ij} \sim \text{N}\left( \bar{\theta},\tau_{\beta}^{2}{\Psi}_{ij}\right), \quad {\Psi}_{ij}=\lambda_{ij}^{2} $$
 
(b)
Two-level prior
$$ \begin{array}{@{}rcl@{}} \beta_{ij} \sim \text{N}\left( \theta_{\pi(ij|1)},\tau_{\beta}^{2}{\Psi}_{ij}\right), &&\quad {\Psi}_{ij}=\lambda_{ij}^{2}\lambda_{\pi(ij|1)}^{2}\\ \theta_{\pi(ij|1)} \sim \text{N}\left( \bar{\theta},{\tau_{1}^{2}}{\Psi}_{\pi(ij|1)}\right), &&\quad {\Psi}_{\pi(ij|1)}=\lambda_{\pi(ij|1)}^{2} \end{array} $$
 
(c)
Three-level prior
$$ \begin{array}{@{}rcl@{}} \beta_{ij} \sim \text{N}\left( \theta_{\pi(ij|1)},\tau_{\beta}^{2}{\Psi}_{ij}\right), &&\quad {\Psi}_{ij}=\lambda_{ij}^{2}\lambda_{\pi(ij|1)}^{2}\lambda_{\pi(ij|2)}^{2}\\ \theta_{\pi(ij|1)} \sim \text{N}\left( \theta_{\pi(ij|2)},{\tau_{1}^{2}}{\Psi}_{\pi(ij|1)}\right), &&\quad {\Psi}_{\pi(ij|1)}=\lambda_{\pi(ij|1)}^{2}\lambda_{\pi(ij|2)}^{2}\\ \theta_{\pi(ij|2)} \sim \text{N}\left( \bar{\theta},{\tau_{2}^{2}}{\Psi}_{\pi(ij|2)}\right), &&\quad {\Psi}_{\pi(ij|2)}=\lambda_{\pi(ij|2)}^{2} \end{array} $$
 
Table 1
Summary of notation
Parameter
Level
Description
Elasticities
βij
= 0
Price elasticity of demand for product i with respect to price of j
𝜃π(ij|)
≥ 1
Level- ancestor elasticity of the (i,j) product pair
Local Variances
\(\lambda _{ij}^{2}\)
= 0
Local variance for βij
\(\lambda _{\pi (ij|\ell )}^{2}\)
≥ 1
Local variance for 𝜃π(ij|)
Ψij
= 0
Product of all higher-level local variances including \(\lambda _{ij}^{2}\)
Ψπ(ij|)
≥ 1
Product of all higher-level local variances including \(\lambda _{\pi (ij|\ell )}^{2}\)
Global Variances
\(\tau _{\beta }^{2}\)
= 0
Global variance across all product elasticities βij
\(\tau _{\ell }^{2}\)
≥ 1
Global variance across all level- elasticities 𝜃π(ij|)

3.3 Choice of mixing densities

The hierarchical priors outlined above endow each product-level elasticity and group-level elasticity with a local variance \(\lambda _{\pi (ij|\ell )}^{2}\) which is distributed according to a level-specific mixing distribution G. As discussed in Section 2.2, the tails of the associated densities g(⋅) will play a key role in shaping the shrinkage imposed by the prior. Although there are now many possible choices for g(⋅), we confine our attention to three forms of shrinkage: (i) ridge, where the mixing density is a degenerate distribution and local variances are fixed to one; (ii) lasso, with an exponential mixing density; and (iii) horseshoe, with a half-Cauchy mixing density. These three types of priors remain common choices in empirical work and have very different shrinkage profiles (Bhadra et al., 2019).
Our hierarchical global-local priors also require the specification of a mixing density for each level of the tree. We can therefore conceive of using different forms of shrinkage across different levels (e.g., ridge at the subcategory level and horseshoe at the product level). However, this also raises questions of whether properties of g(⋅) at higher levels in the tree affect shrinkage behavior at the product level, which we address in the following section.

3.4 Some theory on shrinkage properties

The typical strategy for characterizing shrinkage in global-local models is to examine the shape—specifically, the tails—of the marginal prior for regression coefficients βij. Because global-local priors are scale mixtures of normals, the heaviness of the tails of this marginal prior will be determined by the tails of the mixing density (Barndorff-Nielsen et al., 1982). However, in our setting this analysis is complicated by the fact that the marginal prior for βij will depend on multiple mixing densities in the hierarchical global-local structure.
In this section, we provide three results to characterize shrinkage properties of hierarchical global-local priors. First, we show that the marginal prior for βij still retains a scale mixtures of normals representation and so the mixing densities will continue to play a key role in shaping the shrinkage profile for βij. Second, we show that if the same heavy-tailed mixing density is specified at each level of the tree, then its heaviness will be preserved under the hierarchical product structure that we impose on local variances. Finally, we show that even if a combination of heavy and light-tailed mixing densities are specified across different levels, the heavy-tailed mixing densities will ultimately dominate and shape the product-level shrinkage profile.
We focus on two specific forms of shrinkage. The first is the shrinkage of βij to 𝜃π(ij|m), which is the shrinkage of the product-level elasticity to its level-m parent elasticity. This type of “vertical” shrinkage allows us to assess how quickly product-level elasticities can be pulled towards higher-level elasticities. Here we can write βij as a function of its parent mean and error term:
$$ \beta_{ij} = \theta_{\pi(ij|1)} + \xi_{ij}, \qquad \xi_{ij} \sim \text{N}\left( 0,\tau_{\beta}^{2}{\Psi}_{ij}\right) $$
(7)
where, again, for any \(\ell =1,\dotsc ,L-1\):
$$ \theta_{\pi(ij|\ell)} = \theta_{\pi(ij|\ell+1)}+ \xi_{\pi(ij|\ell)}, \qquad \xi_{\pi(ij|\ell)} \sim \text{N}\left( 0,\tau_{\ell}^{2}{\Psi}_{\pi(ij|\ell)}\right). $$
(8)
We can therefore write βij as a function of its level-m parent elasticity and a sum of errors across levels:
$$ \beta_{ij} = \theta_{\pi(ij|m)} + \sum\limits_{\ell=1}^{m-1}\xi_{\pi(ij|\ell)} + \xi_{ij}, $$
(9)
which yields the prior distribution:
$$ \beta_{ij}-\theta_{\pi(ij|m)}\sim \text{N}\left( 0, \sum\limits_{\ell=1}^{m-1} \tau_{\ell}^{2}{\Psi}_{\pi(ij|\ell)} + \tau_{\beta}^{2}{\Psi}_{ij}\right). $$
(10)
This prior is a scale mixture of normals where the scales are sums over each level’s local variance Ψπ(ij|), which is in turn a product of \(\lambda _{\pi (ij|\ell )}^{2}, \dots , \lambda _{\pi (ij|L-1)}^{2}\). Thus, the local variances will be a sum of products and if there is a value of s (for sm) for which \(\lambda _{\pi (ij|s)}^{2}\) is “small” then βij will tend to be very close to 𝜃π(ij|m).
We also consider the shrinkage of βij to \(\beta _{i^{\prime }j^{\prime }}\) (for ij, \(i^{\prime }\ne j^{\prime }\), and \(i\ne i^{\prime }\)), which is the shrinkage between two product-level elasticities. This type of “horizontal” shrinkage allows us to assess the extent to which elasticities become more similar as they become closer in the tree. Formally define \(m^{\star }=\min \limits \{m:\pi (ij|m)=\pi (i^{\prime }j^{\prime }|m)\}\) to be the lowest level in the tree such that all four products (\(i,j,i^{\prime },j^{\prime }\)) share a common ancestor (i.e., belong to the same group at some level in the tree), where m = L if no common parent node exists for within the tree. Then we can write
$$ \beta_{ij}-\beta_{i^{\prime}j^{\prime}} = {\sum}_{s=\ell}^{m^{\star}-1}\Big(\xi_{\pi(ij|s)}-\xi_{\pi(i^{\prime}j^{\prime}|s)}\Big) $$
(11)
so the prior distribution of the difference is
$$ \beta_{ij}-\beta_{i^{\prime}j^{\prime}} \sim \text{N}\left( 0,\sum\limits_{s=\ell}^{m^{\star}-1} \Big({\Psi}_{\pi(ij|s)}+ {\Psi}_{\pi(i^{\prime}j^{\prime}|s)} \Big) {\tau_{s}^{2}}\right). $$
(12)
Like the marginal distribution of the cross elasticities, the marginal prior distribution of the difference will be a scale mixtures of normals, where the scales are sums of products of \(\lambda _{\pi (ij|s)}^{2}\) and \(\lambda _{\pi (i^{\prime }j^{\prime }|s)}^{2}\). The number of terms in the sum is m − 1 − and so the variance of differences will be tend to be larger if m is larger (i.e., the products are less similar). The form of the variances in Eq. 12 implies that if Ψπ(ij|m) is “small” on level m then Ψπ(ij|s) and \({\Psi }_{\pi (i^{\prime }j^{\prime }|s)}\) will tend to be small for s < m and so the variance of the difference will tend to be smaller further down the tree. This allows shrinkage to propagate down the tree with subsequent sub-categorizations of products tending to have similar cross-elasticities.
The results above show that the priors on βij and the differences \((\beta _{ij} - \beta _{i^{\prime }j^{\prime }})\) can be expressed as normal scale mixtures and so, like in sparse signal detection settings, the shape of the marginal prior will again be determined by the mixing density (Barndorff-Nielsen et al., 1982). However, while there is only one mixing density in traditional regression priors, the marginal priors for βij and \((\beta _{ij} - \beta _{i^{\prime }j^{\prime }})\) involve a “scaled sum of products” transformation over many mixing densities. It is therefore not clear whether the heaviness of the mixing density specified level is: (i) preserved under the scaled sum of products transformations; or (ii) tarnished by mixing densities with lighter tails at higher levels in the tree. We clarify both points in the following two propositions.
Definition 2
The random variable ζ has an L-sum of scaled products distribution if it can be written as \(\zeta = {\sum }_{\ell =1}^{L} \tau _{\ell }^{2} {\prod }_{s=1}^{\ell } {\lambda _{s}^{2}}\) with fixed \({\tau _{1}^{2}},\dots ,{\tau _{L}^{2}}\).
Proposition 1
Suppose ζ has an L-sum of scaled products distribution where \(\lambda _{\ell }^{2}\stackrel {iid}{\sim } G\) for \(\ell =1,\dotsc ,L\) and G is a regularly varying distribution with index α. Then ζ is regularly varying with index α.
Proof
Because the \({\lambda _{s}^{2}}\)’s are all independent and regularly varying with index α, then \({\Psi }_{\ell }={\prod }_{s=1}^{\ell }{\lambda _{s}^{2}}\) is also regularly varying with index α (Cline, 1987). Then the closure property of regularly varying functions guarantees that \(\zeta = {\sum }_{\ell =1}^{L} \tau _{\ell }^{2} {\prod }_{s=1}^{\ell } {\lambda _{s}^{2}}\) is also regularly varying with index α. □
This first result shows that the heaviness of the mixing density at level is preserved under the scaled sum of products transformation. For example, if every λs has a half-Cauchy prior then each \({\lambda _{s}^{2}}\) is an inverted-beta random variable with density g(λ2) ∝ (λ2)− 1/2(1 + λ2)− 1 (Polson & Scott, 2012b), which is regularly varying with index -3/2 and so \({\lambda _{s}^{2}}\) is regularly varying with index 1/2 (Bingham et al., 1987). Then by Proposition 1, the sum of products would also have regularly varying tails, and the different forms of shrinkage in Eqs. 10 and 12 will all have tails of the same heaviness as a standard horseshoe prior.
Proposition 2
Suppose ζ has an L-sum of scaled products distribution with \(\lambda _{\ell }^{2}\sim G_{\ell }\) for \(\ell =1,\dotsc ,L\) and let RV = { : Gisregularlyvaryingwithindexα} be non-empty. If there exists an 𝜖 such that Gs has a finite α + 𝜖 moment for all sRV then ζ is regularly varying with index α.
Proof
If at level at least one of \({\lambda _{1}^{2}}, \dots , \lambda _{\ell }^{2}\) is regularly varying with index α then \({\Psi }_{\ell }={\prod }_{s=1}^{\ell } {\lambda _{s}^{2}}\) is regularly varying with index α. If none of \({\lambda _{1}^{2}}, \dots , \lambda _{\ell }^{2}\) are regularly varying, then by assumption Ψ has a finite α + 𝜖 moment. Since, by assumption, at least one of \({\lambda _{1}^{2}},\dots , {\lambda _{L}^{2}}\) is regularly varying then at least one \({\Psi }_{1},\dotsc ,{\Psi }_{L}\) is regularly varying while the others are guaranteed to have a finite α + 𝜖 moment. Therefore, the closure properties of regularly varying random variables implies that ζ is also regularly varying with index α (Bingham et al., 1987). □
Proposition 2 shows that the sum of products has regular variation if at least one element is regularly-varying. This suggests that sparsity shrinkage of the cross-elasticities at the lowest level of the tree does not require sparsity shrinkage at all levels. For example, we know by Proposition 1 that if \(\lambda _{ij}^{2},\lambda _{\pi (ij|1)}^{2},\dotsc ,\lambda _{\pi (ij|L-1)}^{2}\) are all regularly varying with index α, then their product is also regularly varying with index α. Now by Proposition 2, we know that even if \(\lambda _{\pi (ij|s)}^{2}\) is not regularly varying for s > but has a finite α + 𝜖 moment for some 𝜖 > 0, then the product is again regularly varying. Examples of non-regularly varying distributions with a finite α + 𝜖 moment are a degenerate distribution (i.e., ridge shrinkage with λ2 = 1) and an exponential distribution (i.e., lasso shrinkage). Therefore, heavy tails at any level of the tree are all that is required to get sparsity shrinkage at for the product-level elasticities. We explore different combinations of shrinkage in the simulations and empirical applications below.

4 Posterior computation

Given the log-linear demand system defined in Section 2.1 and hierarchical priors outlined in Section 3, we now turn to our posterior sampling strategy. Note that the presence of product-specific control variables and general error structure in Eq. 1 leads to a seemingly unrelated regression (SUR) model:
$$ \begin{pmatrix} y_{1}\\ y_{2}\\\vdots\\y_{p} \end{pmatrix} = \begin{pmatrix} X & 0 & {\cdots} & 0\\ 0 & X & {\cdots} & 0\\ {\vdots} & {\vdots} & {\ddots} & \vdots\\ 0 & 0 & {\cdots} & X\\ \end{pmatrix} \begin{pmatrix} \beta_{1}\\\beta_{2}\\\vdots\\\beta_{p} \end{pmatrix} + \begin{pmatrix} C_{1} & 0 & {\cdots} & 0\\ 0 & C_{2} & {\cdots} & 0\\ {\vdots} & {\vdots} & {\ddots} & \vdots\\ 0 & 0 & {\cdots} & C_{p}\\ \end{pmatrix} \begin{pmatrix} \phi_{1}\\\phi_{2}\\\vdots\\\phi_{p} \end{pmatrix} + \begin{pmatrix} \varepsilon_{1}\\ \varepsilon_{2}\\\vdots\\ \varepsilon_{p} \end{pmatrix} $$
(13)
where yi is the n × 1 vector of log sales for product i, X is the n × p matrix of log prices, βi is the p × 1 vector of own and cross-price elasticities associated with product i, and Ci is a n × d matrix of control variables with coefficients ϕi. In vector form, we have
$$ y = \text{X}\beta + \text{C}\phi + \varepsilon, \qquad \varepsilon \sim\text{N}(0,{\Sigma}\otimes I_{n}) $$
(14)
where \(y=(y_{1}^{\prime }, y_{2}^{\prime }, \dots , y_{p}^{\prime })'\), \(\text {X}=\text {diag}(X,X,\dotsc ,X)\), \(\beta = (\beta _{1}^{\prime }, \beta _{2}^{\prime }, \dots , \beta _{p}^{\prime })'\), \(\text {C}=\text {diag}(C_{1},C_{2},\dotsc ,C_{p})\), \(\phi = (\phi _{1}^{\prime }, \phi _{2}^{\prime }, \dots , \phi _{p}^{\prime })'\) and \(\varepsilon = (\varepsilon _{1}^{\prime }, \varepsilon _{2}^{\prime }, \dots , \varepsilon _{p}^{\prime })'\).
Sampling from the posterior of a SUR model first requires transforming the model in Eq. 14 into one with homogeneous unit errors. Let U denote the upper triangular Cholesky root of Σ and define \(\tilde {y} = (U^{-1}\otimes I_{n})y\), \(\tilde {\text {X}} = (U^{-1}\otimes I_{n})\text {X}\), and \(\tilde {\text {C}} = (U^{-1}\otimes I_{n})\text {C}\). Then the following transformed system represents a standard normal linear regression model.
$$ \tilde{y} = \tilde{\text{X}}\beta + \tilde{\text{C}}\phi + \tilde{\varepsilon}, \qquad \tilde{\varepsilon}\sim \text{N}(0,I_{np}). $$
(15)
The full set of model parameters includes the elasticities β, the coefficients on controls ϕ, the error variance Σ, and the set of hierarchical prior parameters \({\Omega }=\big (\{\theta _{\pi (ij|\ell )}\},\{\lambda _{\pi (ij|\ell )}^{2}\},\{\tau _{\ell }^{2}\}\big )\).
Priors
The priors for β and all hierarchical hyperparameters are given in Section 3.2. We write the p2 × p2 prior covariance matrix for β as \({\Lambda }_{*}=\tau _{\beta }^{2}\text {diag}(\text {vec}({\Lambda }))\), where Λ is a p × p matrix of local variances Ψij as defined in Eq. 5. Note that for a standard global-local prior, the (i,j)th element of Λ would be \(\lambda _{ij}^{2}\). We place \(\text {N}(\bar {\phi },A_{\phi }^{-1})\) priors on the control variable coefficients, which are conditionally conjugate to the normal likelihood given Σ. Inverse Wishart priors are commonly used for covariance matrices in Bayesian SUR models, however if p > n then Σ will be rank deficient. One approach would be to also regularize Σ (Li et al., 2019; Li et al., 2021). We instead impose a diagonal restriction \({\Sigma }=\text {diag}({\sigma _{1}^{2}},\dotsc ,{\sigma _{p}^{2}})\) and place independent IG(a,b) priors on each \({\sigma _{j}^{2}}\).
Full Conditionals
We construct a Gibbs sampler that cycles between the following full conditional distributions.
$$ \begin{array}{@{}rcl@{}} &&{\Omega}|\beta,\text{data} \end{array} $$
(16)
$$ \begin{array}{@{}rcl@{}} &&{\Sigma}|\beta,\phi,{\Omega},\text{data} \end{array} $$
(17)
$$ \begin{array}{@{}rcl@{}} &&\beta,\phi|{\Sigma},\beta,{\Omega},\text{data} \end{array} $$
(18)
The first full conditional represents the posterior of all global/local variances and higher-level group elasticities. Sampling from these distributions is computationally inexpensive. The elements of 𝜃 each have independent normal posteriors conditional on all variance parameters. Both local variances λ2 and global variances τ2 can also be sampled independently, but the form of their respective posteriors will depend on the choice of prior. Under ridge shrinkage, each λ2 = 1 so no posterior sampling is necessary. Under lasso shrinkage, each λ2 follows an independent exponential distribution and so the full conditionals of 1/λ2 have independent inverse Gaussian distributions (Park & Casella, 2008). Under horseshoe shrinkage, each λ follows an independent half-Cauchy distribution. We follow Makalic and Schmidt (2015) and represent the half-Cauchy as a scale mixture of inverse gammas, which is conjugate to the normal density so the target full conditional can be sampled from directly. Details are provided in Appendix A.1.
The second full conditional represents the posterior of the observational error covariance matrix. Assuming \({\Sigma }=\text {diag}({\sigma _{1}^{2}},\dotsc ,{\sigma _{p}^{2}})\) with independent IG(a,b) priors yields the following posterior:
$$ {\sigma_{j}^{2}}|\beta,\phi,{\Omega},\text{data} \sim\text{IG}\Big(a + n/2,b + (y_{j} - X\beta_{j} - C_{j}\phi_{j})'(y_{j} - X\beta_{j} - C_{j}\phi_{j})/2\Big) $$
(19)
where the j subscript denotes all elements of the given vector or matrix associated with product j. Note that the case of p > n calls for a judicious choice of (a,b) given that diffuse priors will yield barely proper posteriors. If n > p and Σ is unrestricted, typical inverse Wishart priors can be used.
The last full conditional represents the joint posterior of the regression coefficients (β,ϕ). One approach to sampling from the joint posterior is to iterate between each full conditional. For example, the posterior of β conditional on ϕ is:
$$ \beta|\phi,{\Sigma},{\Omega},\text{data} \sim \text{N}\Big((\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}(\tilde{\text{X}}^{\prime}\tilde{y}_{\phi}^{*} + {\Lambda}_{*}^{-1}\bar{\beta}(\theta)),(\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}\Big) $$
(20)
where \(\tilde {y}_{\phi }^{*}= \tilde {y} - \tilde {\text {C}}\phi \). Similarly, the posterior of ϕ conditional on β is:
$$ \phi|\beta,{\Sigma},{\Omega},\text{data}\sim \text{N}\Big(\big(\tilde{\text{C}}^{\prime}\tilde{\text{C}}+A_{\phi}\big)^{-1}\big(\tilde{\text{C}}^{\prime}\tilde{y}_{\beta}^{*}+A_{\phi}\bar{\phi}\big),\big(\tilde{\text{C}}^{\prime}\tilde{\text{C}}+A_{\phi}\big)^{-1}\Big) $$
(21)
where \(\tilde {y}_{\beta }^{*}=\tilde {y}-\tilde {\text {X}}\beta \). However, note that these two Gibbs steps can be improved through blocking. For example, β can be integrated out of the conditional posterior of ϕ:
$$ \phi|{\Sigma},{\Omega},\text{data}\sim \text{N}\Big(\big(\tilde{\text{C}}^{\prime}\text{P}\tilde{\text{C}}+A_{\phi}\big)^{-1}\big(\tilde{\text{C}}^{\prime}\text{P}\tilde{y}+A_{\phi}\bar{\phi}\big),\big(\tilde{\text{C}}^{\prime}\text{P}\tilde{\text{C}}+A_{\phi}\big)^{-1}\Big) $$
(22)
where \(\text {P}=I_{np}-\tilde {\text {X}}(\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})^{-1}\tilde {\text {X}}^{\prime }\) is an orthogonal projection matrix. Marginalizing over β will yield improvements in convergence and mixing, and comes at virtually no additional cost since the inverse contained in the projection matrix must be computed to sample from Eq. 21. It should also be noted that the posterior precision matrix in Eq. 21 requires the inversion of a p2 × p2 matrix which is computationally expensive when p is large. We therefore present two strategies to facilitate scalability in the following subsections.

4.1 Diagonal restriction on Σ

As with all Bayesian regression models, a computational bottleneck arises in inverting the posterior precision matrix \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\). This especially true for Bayesian SUR models since the design matrix X contains p stacked copies of the multivariate regression design matrix. If Σ is unrestricted, then \(\tilde {\text {X}}^{\prime }\tilde {\text {X}}\) is a dense p2 × p2 matrix and any sampler that directly inverts this matrix will be hopeless for large p. For example, even a sampler that calculates the inverse using Cholesky decompositions has complexity \(\mathcal {O}(p^{6})\). If instead Σ is assumed to be diagonal then both \(\tilde {\text {X}}^{\prime }\tilde {\text {X}}\) and \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) will have block diagonal structures, with each of the p blocks containing an p × p matrix. Computing the inverse of \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) then amounts to inverting each p × p block, which has computational complexity \(\mathcal {O}(p^{4})\) using Cholesky decompositions. While this is better than inverting \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) directly, it can still be prohibitively expensive for large p.

4.2 Fast sampling normal scale mixtures

Bhattacharya et al. (2016) present an alternative approach for sampling from the posteriors of linear regression models with normal scale mixture priors. The idea is to use data augmentation and a series of linear transformations to avoid the inversion of \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\). Instead, their algorithm replaces the inversion of \(\tilde {\text {X}}^{\prime }\tilde {\text {X}}\) with the inversion of \(\tilde {\text {X}}\tilde {\text {X}}^{\prime }\). For a multiple regression model, this means the matrix being inverted is n × n instead of p × p and the proposed algorithm has complexity that is linear in p. In the context of our SUR model, the fast sampling algorithm has complexity \(\mathcal {O}(n^{2}p^{2})\) if Σ is diagonal or \(\mathcal {O}(n^{2}p^{4})\) if Σ is unrestricted.
Since the original algorithm was also developed for typical shrinkage priors centered at zero, we present a modified algorithm to allow for the nonzero mean structure, which we denote as \(\bar {\beta }(\theta )\), in the proposed hierarchical priors:
1.
Sample \(u\sim \text {N}(\bar {\beta }(\theta ),{\Lambda }_{*})\) and \(\delta \sim \text {N}(0,I_{np})\);
 
2.
Set \(v=\tilde {\text {X}}u + \delta \);
 
3.
Compute \(w=(\tilde {\text {X}}{\Lambda }_{*}\tilde {\text {X}}^{\prime }+I_{np})^{-1}(\tilde {y}-v)\);
 
4.
Set \(\beta =u+{\Lambda }_{*} \tilde {\text {X}}^{\prime }w\).
 
A constructive proof that β retains the posterior in Eq. 21 is provided in Appendix A.2. Note that the computational gains come from the third step, which requires inverting the np × np matrix \((\tilde {\text {X}}{\Lambda }_{*}\tilde {\text {X}}^{\prime }+I_{np})\) rather than the original p2 × p2 precision matrix \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\). This also shows that the computational gains are largest when p is much larger than n.

4.3 Scalability

To provide practical insights into the computational gains afforded by fast sampling algorithm above, we draw from the posterior of the elasticity vector β using data generated with n = 100 and \(p\in \{100,200,300,\dotsc ,1000\}\). In addition to the fast sampler of Bhattacharya et al. (2016), we also provide results for a “standard” sampler that inverts the p2 × p2 precision matrix \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) via Cholesky decompositions (see, e.g., chapters 2.12 and 3.5 of Rossi et al., 2005). In both cases we assume Σ is diagonal. The samplers are coded in Rcpp (Eddelbuettel and François, 2011) and run on a MacBook Pro laptop with 32GB of RAM and an Apple M1 Max processor. Figure 2 plots the computation time in log seconds against the number of products p. We find that the fast sampler offers significant computational savings: it is roughly two times faster when p = 200, 10 times faster when p = 500, and 30 times faster when p = 1000.

5 Simulation experiments

We first explore the performance of hierarchical global-local priors using simulated data. We generate data from price elasticity matrices that have different underlying structures in order to illustrate differences between sparse and non-sparse shrinkage, as well as differences in the heaviness of shrinkage used across levels in the tree. Each of these structures is described in detail below.
(I)
Hierarchical + Dense: A dense vector of elasticities βij is generated from a three-level hierarchical prior. The top-level coefficients 𝜃π(ij|2) are sampled from a uniform distribution over the interval {-3,-1}∪{1,3} and we fix \(\lambda _{\pi (ij|\ell )}^{2}=1\) and \(\tau _{\ell }^{2}=1\) for all (i,j) and across all levels. The middle-level coefficients 𝜃π(ij|1) and product elasticities βij are generated through the model outlined in Section 3.2. In this specification, all pairs of goods have a non-zero cross-price elasticity.
 
(II)
Hierarchical + Sparse After Transformation: A dense vector of elasticities βij is generated from a three-level hierarchical prior where 75% of the product-level local variances Ψij are set to zero so that the corresponding product-level elasticity βij is exactly equal to its prior mean. Thus, all pairs of product groups have a non-zero cross-price elasticity and many product-level elasticities are exactly equal to the group-level elasticity. This creates a structure where β appears dense, but is sparse after subtracting off the prior mean parameters 𝜃π(ij|1).
 
(III)
Hierarchical + Group-Wise Sparse: A “group-wise” sparse vector of elasticities βij is generated from a three-level hierarchical prior where 75% of the level-1 coefficients 𝜃π(ij|1) and local variances Ψπ(ij|1) are both set to zero. This allows the product-level elasticities to inherit their sparsity from higher levels in the tree, which in turn produces blocks of cross elasticities that are either all dense or exactly equal to zero.
 
(IV)
Sparse: A sparse vector of elasticities βij is generated from a non-hierarchical prior with 95% of all elasticities set to zero.
 
For each specification above, we generate data from the regression model in Eq. 1 with dimensions (n = 50, p = 100) and (n = 100, p = 300). For specifications (I)–(III), we use a three-level classification tree with five equal-sized groups on the top level and 10 equal-sized groups in the middle level. Across all specifications, we simulate the own elasticities from a Unif(-5,-1) distribution, fix ϕi = 0, define \({\Sigma }=\text {diag}({\sigma _{1}^{2}},\dotsc ,{\sigma _{p}^{2}})\) with \({\sigma _{j}^{2}}=1\), and generate elements of data matrices X and Cj from a N(0,1) distribution. We generate 25 data sets from each specification. Examples of the resulting elasticity matrices are shown in Fig. 3.
We take a total of nine models to the data. The first three models impose standard shrinkage with fixed shrinkage points set to zero. The second three models include a hierarchical prior on β with ridge shrinkage for the upper-level parameters 𝜃. The final three models include a hierarchical prior on β with horseshoe shrinkage at the upper level. Within each batch, we specify three types of mixing distributions on \(\lambda _{ij}^{2}\) which in turn induce three types of shrinkage for β: ridge, lasso, and horseshoe. All models are fit using the MCMC sampling algorithms outlined in Section 4. In Appendix B.1, we plot the posterior means and 95% credible intervals for elasticity parameters in the (𝜃-Ridge, β-Ridge) model to demonstrate that parameters can be accurately recovered.
Model fit statistics are reported in Table 2. The top panel reports results from the (n = 50, p = 100) data sets and the bottom panel reports results from the (n = 100, p = 300) data sets. The first four columns report the root mean squared error (RMSE) associated with the estimation of β in each specification, averaged across the 25 data replicates. The results illustrate the relative strengths of each type of shrinkage prior. In specification (I) where β is purely dense, the hierarchical prior with ridge shrinkage across all levels of the tree fits best. Not surprisingly, a ridge prior is most appropriate when the true parameter is dense. In specification (II) where β appears dense but is sparse after subtracting off the prior mean, the hierarchical prior with horseshoe shrinkage for β and ridge shrinkage for 𝜃 is best. This shows how heavy horseshoe shrinkage can still be desirable for parameters that are sparse under some transformation of the data. In specification (III) where β is “group-wise” sparse, a hierarchical prior with horseshoe shrinkage for 𝜃 and ridge shrinkage for β is best. By forcing β to inherit its heaviness from 𝜃, the prior for β still exhibits heavy shrinkage—reaffirming some of the theory results in Section 3.4—and behaves like group-wise variable selection. In all three specifications, all hierarchical priors substantially outperform the standard shrinkage priors. In specification (III), unlike specifications (I) and (II), all 𝜃-Horseshoe priors outperform all 𝜃-Ridge priors. Finally, in specification (IV) where β is sparse in the traditional sense, then the standard horseshoe prior with fixed shrinkage at zero is best. However, either of the hierarchical priors with horseshoe shrinkage for β also offer competitive performance.
Table 2
Simulation results
 
Estimation RMSE
Correct Signs
Data: n = 50, p = 100
(I)
(II)
(III)
(IV)
(I)
(II)
(III)
(IV)
Standard Shrinkage
β-Ridge
2.541
2.375
1.219
0.492
0.77
0.78
0.87
1.00
β-Lasso
2.622
2.473
1.079
0.314
0.76
0.77
0.89
1.00
β-Horseshoe
2.879
2.745
0.969
0.098
0.73
0.74
0.88
1.00
Hierarchical Shrinkage
𝜃-Ridge, β-Ridge
1.033
0.530
0.515
0.490
0.94
0.97
0.94
1.00
𝜃-Ridge, β-Lasso
1.047
0.461
0.451
0.313
0.93
0.98
0.95
1.00
𝜃-Ridge, β-Horseshoe
1.126
0.385
0.407
0.102
0.92
0.98
0.95
1.00
𝜃-Horseshoe, β-Ridge
1.044
0.526
0.190
0.472
0.93
0.98
0.98
1.00
𝜃-Horseshoe, β-Lasso
1.053
0.461
0.192
0.310
0.93
0.98
0.98
1.00
𝜃-Horseshoe, β-Horseshoe
1.130
0.386
0.270
0.100
0.92
0.98
0.97
1.00
Data: n = 100, p = 300
(I)
(II)
(III)
(IV)
(I)
(II)
(III)
(IV)
Standard Shrinkage
β-Ridge
2.938
2.743
1.275
0.545
0.72
0.73
0.83
0.99
β-Lasso
2.972
2.790
1.232
0.427
0.72
0.73
0.84
1.00
β-Horseshoe
3.225
3.045
1.185
0.068
0.69
0.70
0.84
1.00
Hierarchical Shrinkage
𝜃-Ridge, β-Ridge
1.162
0.585
0.530
0.544
0.94
0.98
0.93
0.99
𝜃-Ridge, β-Lasso
1.170
0.547
0.488
0.427
0.93
0.98
0.94
1.00
𝜃-Ridge, β-Horseshoe
1.253
0.503
0.471
0.069
0.93
0.99
0.94
1.00
𝜃-Horseshoe, β-Ridge
1.169
0.586
0.170
0.543
0.93
0.98
0.98
0.99
𝜃-Horseshoe, β-Lasso
1.172
0.548
0.174
0.426
0.93
0.98
0.98
1.00
𝜃-Horseshoe, β-Horseshoe
1.255
0.508
0.233
0.068
0.93
0.99
0.97
1.00
Fit statistics are averaged across 25 simulated data sets. Columns denote the four different true parameter specifications: (I) hierarchical + dense; (II) hierarchical + sparse after transformation; (III) hierarchical + group-wise sparse; and (IV) sparse. Standard shrinkage priors have a fixed mean of zero and do not incorporate any hierarchical structure. Hierarchical shrinkage priors are based on a three-level tree
The last four columns of Table 2 report the share of signs that are correctly estimated, averaging across data replicates. Producing incorrectly signed elasticity estimates—e.g., negative cross elasticity estimates for goods that are obviously substitutes—is one of the long-standing empirical problems with unrestricted log-linear models (Allenby, 1989; Blattberg & George, 1991; Boatwright et al., 1999). We find that hierarchical priors are largely able to mitigate this issue. For example, while the standard shrinkage priors correctly estimate 73–78% of signs in specifications (I) and (II), hierarchical priors correctly estimate 92–98%.
So far we have assumed that the product classification tree fit to the data matches that of the data generating process. In reality, we do not know whether there is a true hierarchical structure governing cross elasticities, let alone what that structure looks like. In this paper we take the view that category definitions of the retailer or data provider can be used as a reasonable proxy for boundaries in substitution, though this may ultimately be an empirical question (see, e.g., Smith et al., 2019). In Appendix B we provide additional simulations that explore robustness to tree misspecification. We conduct three sets of simulations that vary in the degree of tree misspecification. We find that hierarchical shrinkage can still provide gains (and is no-worse than standard shrinkage) in the presence of misspecification, although the magnitude of these gains vanishes as the degree of misspecification grows. This suggests that hierarchical priors are fairly robust to misspecification of the tree but performance gains over standard sparse approaches depends on the quality of the tree.

6 Empirical application

6.1 Data

We apply log-linear demand models with standard and hierarchical shrinkage priors to IRI store-level transaction data (Bronnenberg et al., 2008). We use data from the largest grocery retail store in Pittsfield, Massachusetts for the two-year period spanning 2011–2012. We use the first 78 weeks as training data and the last 26 weeks as holdout data to evaluate model fit. Although there is the potential to add data from other chains and markets, we take the perspective of the retailer who wants to estimate store-level elasticities to allow for more granular customization of marketing instruments (Montgomery, 1997).
The scope of products included in a large-scale demand analysis usually take one of two forms: (i) narrow and deep, where products only come from one category or subcategory but are defined at a very granular level like the UPC; or (ii) wide and shallow, where products span many categories but are defined at a higher level of aggregation. Here we take the latter approach, in part because a UPC-level analysis often creates challenges for log-linear models due to the potentially high incidence of zero quantities. Estimating demand for wider multi-category assortments also highlights the general flexibility of log-linear systems—e.g., the ability to accommodate a mix of substitutes and complements.
Our procedure for selecting products is as follows. We first choose nine product categories that cover a large portion of total expenditure. Within each category, we remove any subcategories that account for less than 2% of within category revenue, resulting in a total of 28 subcategories. We then aggregate UPCs to the category-subcategory-brand level using total volume (in ounces) and revenue-weighted prices (per ounce). We sort brands by their within-subcategory revenue and then keep top brands that together capture 85% of subcategory revenue and are also present in at least 95% of weeks in the data. This results in p = 275 total products, comprised of over 170,000 UPCs, that make up 81.5% of total revenue within the nine chosen categories. A list of the categories and subcategories used in our analysis is provided in Table 3.
Table 3
Summary of product categories
Category
Subcategories
No. of Products
Share of Revenue
BEER/ALE
Domestic Beer/Ale
62
16.1
 
Imported Beer/Ale
  
CARBONATED BEVERAGES
Low Calorie Soft Drinks
30
19.1
 
Regular Soft Drinks
  
 
Seltzer/Tonic/Club Soda
  
COFFEE
Ground Coffee
27
8.5
 
Ground Decaffeinated Coffee
  
 
Instant Coffee
  
 
Single Cup Coffee
  
 
Whole Coffee Beans
  
COLD CEREAL
Ready-to-Eat Cereal
53
9.3
FZ DINNERS/ENTREES
Fz Handheld Entrees
36
7.8
 
Multi-Serve Fz Dinners
  
 
Single-Serve Fz Dinners
  
FZ PIZZA
Fz Pizza
7
5.0
MILK
Rfg Almond Milk
10
12.9
 
Rfg Flavored Milk/Eggnog/Buttermilk
  
 
Rfg Skim/Lowfat Milk
  
 
Rfg Soy Milk
  
 
Rfg Whole Milk
  
SALTY SNACKS
Cheese Snacks
35
13.4
 
Corn Snacks
  
 
Other Salted Snacks
  
 
Potato Chips
  
 
Pretzels
  
 
Ready to Eat Popcorn/Caramel Corn
  
 
Tortilla/Tostada Chips
  
YOGURT
Rfg Yogurt
15
7.9
Total Count = 9
28
275
100%
In our analysis, we aggregate over flavors and pack sizes and define the product at the brand level. In this product classification tree, the category is the top level ( = 2), the subcategory is the middle level ( = 1), and the product is the bottom level ( = 0). Here there are 92 = 81 category elasticities 𝜃π(ij|2), 282 = 784 subcategory elasticities 𝜃π(ij|1), and 2752 = 75, 625 product elasticities βij. See Table 1 for a complete glossary of parameters associated with each level of the tree

6.2 Models

We estimate log-linear demand models of the form in Eq. 1 assuming a diagonal error variance matrix Σ. In addition to the high-dimensional vector of prices, we also include product intercepts, summer and holiday dummy variables, and feature and display promotion variables as controls. We consider three different types of shrinkage priors for the price elasticity parameters: standard (where the prior mean is fixed to zero), hierarchical with ridge shrinkage at the upper levels, and hierarchical with horseshoe shrinkage at the upper levels. At the product level we consider ridge, lasso, and horseshoe shrinkage. Together, this results in a total of nine models.
Across all models, we allow the own elasticities to have a different prior than the cross elasticities to account for differences in their expected sign and magnitude. Specifically, in the sparse models we allow the own elasticities to have a separate global variance parameter. In the hierarchical models, we specify separate hierarchical priors for the own and cross elasticities. Each is based on a three-level tree (products at = 0, subcategories at = 1, categories at = 2) where shrinkage starts to propagate at the subcategory level. A complete list of prior and hyperparameter specifications is provided in Appendix C.1. All models are estimated using the MCMC algorithms outlined in Section 4 which are run for 100,000 iterations and thinned by keeping every 100th draw. The standard shrinkage models, hierarchical 𝜃-Ridge models, and hierarchical 𝜃-Horseshoe models take roughly two minutes, four minutes, and six minutes per 1,000 iterations, respectively. Mixing diagnostics are reported in Appendix C.

6.3 Results

6.3.1 Predictive fit

Table 4 reports the out-of-sample RMSEs associated with three sets of demand predictions. The first two columns in Table 4 report the mean and standard deviation of RMSEs across all 275 products. The middle two columns report RMSEs for a subset of 152 products with extrapolated price levels in the test sample. A product is included in this set if at least one price point in the test sample falls outside of its range of prices in the training sample. The final two columns report RMSEs for a subset of 69 products with limited price variation. A product j is included in this set if \(\text {SD}_{j} = \text {SD}(\log p_{j1},\dotsc ,\log p_{jn})\) falls in the lower quartile of the distribution of \(\text {SD}_{1},\dotsc ,\text {SD}_{275}\). Together, these three types of predictions allow us to examine model performance across a range of forecasting scenarios that retailers may face: predicting demand for an entire assortment, predicting demand at counterfactual price levels, and predicting demand for “low signal” goods.
Table 4
Out-of-sample fit statistics
   
Extrapolated
Limited
 
All Products
Price Levels
Price Variation
 
Mean
SD
Mean
SD
Mean
SD
Standard Shrinkage
β-Ridge
0.846
(0.126)
0.871
(0.190)
0.968
(0.347)
β-Lasso
0.847
(0.117)
0.865
(0.181)
0.960
(0.331)
β-Horseshoe
0.895
(0.170)
0.937
(0.258)
1.085
(0.449)
Hierarchical Shrinkage
𝜃-Ridge, β-Ridge
0.808
(0.104)
0.816
(0.148)
0.895
(0.276)
𝜃-Ridge, β-Lasso
0.814
(0.113)
0.824
(0.170)
0.909
(0.314)
𝜃-Ridge, β-Horseshoe
0.899
(0.131)
0.889
(0.194)
1.004
(0.347)
𝜃-Horseshoe, β-Ridge
0.842
(0.119)
0.825
(0.168)
0.919
(0.307)
𝜃-Horseshoe, β-Lasso
0.823
(0.117)
0.821
(0.169)
0.908
(0.308)
𝜃-Horseshoe, β-Horseshoe
0.993
(0.159)
0.845
(0.162)
0.902
(0.305)
Out-of-sample RMSEs are reported for: (i) all 275 products; (ii) the 152 products with extrapolated price levels in the test sample; (iii) the 69 products in the lower quartile of the distribution of price variation. Standard shrinkage priors have a fixed mean of zero and do not incorporate any hierarchical structure. Hierarchical shrinkage priors are based on a three-level tree with means estimated at each level
We find that the (𝜃-Ridge, β-Ridge) hierarchical prior provides the best predictions across all three tasks. Relative to the best-performing standard shrinkage prior, the (𝜃-Ridge, β-Ridge) prior generates a 4.5% improvement in predictions across all products, a 5.7% improvement for products with extrapolated price levels, and a 6.8% improvement for products with limited price variation. We also find that horseshoe shrinkage at the product level tends to perform worst, regardless of if and how the prior mean is parameterized. One possible explanation is that the true elasticities are actually dense, rather than sparse, in which case ridge shrinkage should perform well (as shown in the simulations in Section 5). A second possibility is that the true magnitude of the elasticities may not be large enough to warrant a heavy-tailed prior like the horseshoe. In reality, it may be a combination of both; we will revisit this question as we analyze estimates from each model below.

6.3.2 Product-level elasticities

Next, we compare the estimated product-level elasticities from each model. Table 5 reports the overall average and the 10th, 50th, and 90th percentiles of the distribution of posterior means of βii and βij. We also report the share of own elasticities that are negative and the share of own elasticities that are significant at the 5% level. Complete distributions of price elasticity and promotion effect estimates are shown in Appendix C.3. Elasticity estimates are markedly different across prior specifications. Starting with own elasticities, we find that standard priors produce distributions of own elasticities where roughly 84% of estimates are negative, 21% of estimates are significant, and the average (median) own elasticity is around -1.2 (-0.9). In contrast, hierarchical priors produce distributions of own elasticity estimates where 93% are negative, 50% are significantly away from zero, and the average (median) own elasticity is around -1.5 (-1.4). We believe that being able to produce more economically reasonable and precise own elasticity estimates by simply shrinking to higher-level elasticities rather than zero is a strength of our approach.
Table 5
Summary of product-level price elasticity estimates
 
Own Elasticities βii
Cross Elasticities βij
 
Neg
Sig
Mean
10th
50th
90th
Mean
10th
50th
90th
Standard Shrinkage
β-Ridge
84.0
25.8
-1.20
-2.97
-1.09
0.41
-0.000
-0.000
-0.000
0.000
β-Lasso
83.6
21.5
-1.13
-2.95
-0.86
0.28
-0.000
-0.000
-0.000
0.000
β-Horseshoe
85.1
16.7
-1.23
-3.38
-0.82
0.26
-0.001
-0.004
-0.000
0.003
Hierarchical Shrinkage
𝜃-Ridge, β-Ridge
93.5
41.5
-1.58
-3.01
-1.53
-0.28
-0.004
-0.023
-0.004
0.012
𝜃-Ridge, β-Lasso
94.2
42.9
-1.55
-2.87
-1.47
-0.27
-0.004
-0.024
-0.004
0.016
𝜃-Ridge, β-Horseshoe
89.5
49.5
-1.42
-2.64
-1.40
0.09
-0.001
-0.121
-0.002
0.114
𝜃-Horseshoe, β-Ridge
97.5
54.9
-1.63
-2.88
-1.48
-0.59
-0.004
-0.020
-0.001
0.013
𝜃-Horseshoe, β-Lasso
97.1
61.1
-1.57
-2.77
-1.46
-0.64
-0.004
-0.019
-0.003
0.010
𝜃-Horseshoe, β-Horseshoe
87.6
49.5
-1.30
-2.56
-1.14
0.05
-0.008
-0.086
-0.004
0.071
Percentiles of the distribution of own and cross-price elasticity estimates are reported for each model. We also report the share of negative own elasticities (“Neg”) and the share of own elasticities whose 95% credible intervals do not contain zero (“Sig”). Standard shrinkage priors have a fixed mean of zero and do not incorporate any hierarchical structure. Hierarchical shrinkage priors are based on a three-level tree
The distribution of cross elasticity estimates also differs across prior specifications. We are estimating 75,350 cross elasticities using less than 100 weeks of training data, and so it is not surprising that the prior imposes heavy regularization. Because typical sparse priors shrink estimates towards zero, the associated distribution of estimates is highly concentrated around zero. Hierarchical priors produce distributions of estimates that are also centered around zero, but spread mass more widely in the (-0.1, 0.1) interval. The exact shape of the distribution depends on the type of shrinkage imposed at each level of the tree. Interestingly, the (𝜃-Horseshoe, β-Horseshoe) prior generates a more widely dispersed distribution of elasticities than the (𝜃-Ridge, β-Ridge) prior. This is a consequence of hierarchical priors: endowing βij with a potentially non-zero prior mean leads to a build-up of mass away from zero when shrinkage in strong. Differences in the shape of cross-elasticity estimates can also be seen in the histograms reported in Appendix C.3.

6.3.3 Higher-level elasticities

In addition to product-level elasticities, we can also examine higher-level group elasticities in any of the hierarchical priors. For the sake of brevity, we focus our discussion on estimates from the (𝜃-Ridge, β-Ridge) prior. Figure 4 plots the distributions of higher-level elasticities at the category and subcategory level. The figure on the left plots own elasticities and the figure on the right plots cross elasticities, where the within-group elasticities (green lines) are separated from the across-group elasticities (red lines). Here, the within-group elasticities represent the diagonal elements of the category or subcategory-level elasticity matrix. For example, the BEER within-category elasticity represents the average cross-price elasticity of all products within the BEER category. Similarly, the Domestic Beer/Ale within-subcategory elasticity represents the average cross-price elasticity of all products within the Domestic Beer/Ale subcategory.
Figure 4 illustrates two economic properties of elasticities that are not imposed on the model but are a consequence of hierarchical shrinkage. First, own elasticities tend to be negative and are slightly more negative at the subcategory level than at the category level. The fact that product-level own elasticities are shrunk towards these negative values rather than zero is one reason why own elasticities tend to be more negative under hierarchical priors (as shown in Table 5). Second, cross elasticities within each category/subcategory are mostly positive and shifted to the right of the distribution of elasticities across categories/subcategories. This suggests that substitution and product competition tend to be strongest within each category/subcategory.
Table 6 provides further insight into the nature of category and subcategory-level substitution. For each category/subcategory, we report the associated largest (most positive) elasticity and smallest (most negative) elasticity. We find that for six of the nine categories, demand changes most in response to price changes within the same category (i.e., the within elasticities shown in Fig. 4). For example, the largest elasticity associated with the demand of BEER/ALE is the price of BEER/ALE. This is to be expected if the competition is stronger within categories than across categories. Many of the cross-category relationships also appear reasonable. For example, BEER/ALE and SALTY SNACKS, CARBONATED BEVERAGES and SALTY SNACKS, and MILK and CEREAL are all pairs of strong complements. Results at the subcategory level are similar. For 15 of the 28 subcategories, the largest elasticity corresponds to the either the within subcategory elasticity or the across elasticity within the category (e.g., Domestic vs. Imported Beer/Ale). Many of the strong cross-subcategory complements such as Imported Beer/Ale and Tortilla/Tostada Chips or Rfg Skim/Lowfat Milk and Ready-to-Eat Cereal appear to be inherited from their respective cross-category parent elasticities.
Table 6
Summaries of higher-level cross-price elasticities from the (𝜃-Ridge, β-Ridge) Hierarchical Prior
   
Largest (Most Positive) Elasticity
 
Smallest (Most Negative) Elasticity
 
Category
1
BEER/ALE
BEER/ALE
0.014
SALTY SNACKS
-0.017
 
2
CARBONATED BEVERAGES
CARBONATED BEVERAGES
0.015
SALTY SNACKS
-0.020
 
3
COFFEE
COFFEE
0.009
BEER/ALE
-0.008
 
4
COLD CEREAL
CARBONATED BEVERAGES
0.007
MILK
-0.023
 
5
FZ DINNERS/ENTREES
FZ DINNERS/ENTREES
0.004
CARBONATED BEVERAGES
-0.019
 
6
FZ PIZZA
FZ PIZZA
0.002
COFFEE
-0.006
 
7
MILK
MILK
0.005
COLD CEREAL
-0.010
 
8
SALTY SNACKS
FZ DINNERS/ENTREES
0.002
MILK
-0.015
 
9
YOGURT
FZ PIZZA
0.003
SALTY SNACKS
-0.010
Subcategory
1
Domestic Beer/Ale
Domestic Beer/Ale
0.040
Ground Decaffeinated Coffee
-0.046
 
2
Imported Beer/Ale
Domestic Beer/Ale
0.050
Tortilla/Tostada Chips
-0.041
 
3
Low Calorie Soft Drinks
Regular Soft Drinks
0.034
Tortilla/Tostada Chips
-0.033
 
4
Regular Soft Drinks
Regular Soft Drinks
0.024
Tortilla/Tostada Chips
-0.033
 
5
Seltzer/Tonic Water/Club Soda
Low Calorie Soft Drinks
0.022
Corn Snacks
-0.021
 
6
Ground Coffee
Ground Coffee
0.030
Domestic Beer/Ale
-0.016
 
7
Ground Decaffeinated Coffee
Single Serve Fz Dinners/Entrees
0.013
Domestic Beer/Ale
-0.020
 
8
Instant Coffee
Whole Coffee Beans
0.008
Rfg Yogurt
-0.019
 
9
Single Cup Coffee
Instant Coffee
0.010
Domestic Beer/Ale
-0.011
 
10
Whole Coffee Beans
Single Serve Fz Dinners/Entrees
0.014
Ready-to-Eat Cereal
-0.017
 
11
Ready-to-Eat Cereal
Regular Soft Drinks
0.017
Rfg Flavored Milk/Eggnog/Butterm
-0.049
 
12
Fz Handheld Entrees
Domestic Beer/Ale
0.009
Regular Soft Drinks
-0.031
 
13
Multi Serve Fz Dinners/Entrees
Single Serve Fz Dinners/Entrees
0.023
Low Calorie Soft Drinks
-0.028
 
14
Single Serve Fz Dinners/Entrees
Single Serve Fz Dinners/Entrees
0.041
Regular Soft Drinks
-0.030
 
15
Fz Pizza
Ready-to-Eat Popcorn/Caramel Cor
0.006
Single Serve Fz Dinners/Entrees
-0.018
 
16
Rfg Almond Milk
Rfg Yogurt
0.015
Ready-to-Eat Cereal
-0.010
 
17
Rfg Flavored Milk/Eggnog/Butterm
Rfg Flavored Milk/Eggnog/Butterm
0.008
Ready-to-Eat Cereal
-0.016
 
18
Rfg Skim/Lowfat Milk
Rfg Whole Milk
0.008
Ready-to-Eat Cereal
-0.019
 
19
Rfg Soy Milk
Rfg Flavored Milk/Eggnog/Butterm
0.007
Ready-to-Eat Cereal
-0.013
 
20
Rfg Whole Milk
Rfg Whole Milk
0.007
Domestic Beer/Ale
-0.014
 
21
Cheese Snacks
Domestic Beer/Ale
0.002
Rfg Almond Milk
-0.017
 
22
Corn Snacks
Single Serve Fz Dinners/Entrees
0.007
Rfg Yogurt
-0.015
 
23
Other Salted Snacks
Single Serve Fz Dinners/Entrees
0.012
Rfg Flavored Milk/Eggnog/Butterm
-0.019
 
24
Potato Chips
Potato Chips
0.010
Rfg Flavored Milk/Eggnog/Butterm
-0.019
 
25
Pretzels
Fz Handheld Entrees
0.010
Rfg Whole Milk
-0.015
 
26
Ready-to-Eat Popcorn/Caramel Cor
Single Serve Fz Dinners/Entrees
0.014
Rfg Soy Milk
-0.016
 
27
Tortilla/Tostada Chips
Single Serve Fz Dinners/Entrees
0.008
Rfg Whole Milk
-0.020
 
28
Rfg Yogurt
Potato Chips
0.011
Ready-to-Eat Cereal
-0.022
The top section of the table reports posterior means of selected category elasticities 𝜃π(ij|2). The bottom section of the table reports posterior means of selected subcategory elasticities 𝜃π(ij|1). See Table 1 for a glossary of parameters associated with each level of the tree

6.3.4 Shrinkage factors

In addition to the summary of elasticity parameters presented above, we can also summarize the variance parameters to learn about the strength of shrinkage imposed by the prior. To this end, we explore the posterior distribution of product-level shrinkage factors, which are common summary measures used to convey the strength of shrinkage in global-local models (Carvalho et al., 2010; Bhadra et al., 2019). Shrinkage factors measure the extent to which the posterior mean is pulled towards the prior. Formally, the posterior mean of the regression coefficient β takes the form:
$$ \mathbb{E}(\beta|\text{data})= (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}\tilde{\text{X}}^{\prime}\tilde{\text{X}}\hat{\beta}+(\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}{\Lambda}_{*}^{-1}\bar{\beta}(\theta) $$
(24)
which is a weighted average of the maximum likelihood estimator \(\hat {\beta }\) and the prior mean \(\bar {\beta }(\theta )\). The weight on the prior mean, \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})^{-1}{\Lambda }_{*}^{-1}\), defines the matrix of shrinkage factors. For simplicity, we use the component-wise approximation:
$$ \mathbb{E}(\beta_{ij}|\text{data}) \approx (1-\kappa_{ij})\hat{\beta}_{ij}+\kappa_{ij}\bar{\beta}(\theta)_{ij} $$
(25)
where
$$ \kappa_{ij}=\frac{1}{1+n{s_{j}^{2}}\sigma_{i}^{-2}\tau_{\beta}^{2}{\Psi}_{ij}} $$
(26)
and \({s_{j}^{2}}=\text {Var}(\log p_{j})\). Notice that \(n{s_{j}^{2}}/{\sigma _{i}^{2}}\) can be interpreted as a signal-to-noise ratio. When the signal dominates the noise, κij → 0 and the posterior mean of βij converges to \(\hat {\beta }_{ij}\); when the noise dominates, then κij → 1 and the posterior mean of βij converges to the prior mean.
Table 7 reports summary statistics of global variances and product-level price elasticity shrinkage factors across model specifications. We first compute the posterior median of each κij and then report summary statistics across the distribution of estimates. The first finding is that there is a sizable difference in the strength of shrinkage between own and cross-price elasticities. We find that estimates of \(\tau _{\beta \text {cross}}^{2}\) tend to be four to five orders of magnitude smaller than \(\tau _{\beta \text {own}}^{2}\). Consequently, estimates of κij (for ij) tend to be bunched at one while estimates of κii are more dispersed throughout the unit interval. One explanation for this difference in shrinkage is that retail scanner data tends to exhibit a stronger signal for estimating own elasticities than cross elasticities (Hitsch et al., 2021). Our estimation problem is also high-dimensional as we are estimating 75,350 cross elasticity parameters from 78 weeks of training data.
Table 7
Global variances and shrinkage factors
 
Own Elasticities
Cross Elasticities
  
κii
 
κij
 
\(\tau _{\beta \text {own}}^{2}\)
Min
Mean
Max
\(\tau _{\beta \text {cross}}^{2}\)
Min
Mean
Max
Standard Shrinkage
β-Ridge
5.11
0.01
0.15
1.00
2.51E-06
1.00
1.00
1.00
β-Lasso
5.07
0.01
0.19
1.00
1.13E-05
1.00
1.00
1.00
β-Horseshoe
7.33
0.00
0.19
1.00
6.84E-06
0.01
1.00
1.00
Hierarchical Shrinkage
𝜃-Ridge, β-Ridge
1.77
0.02
0.29
1.00
9.63E-05
1.00
1.00
1.00
𝜃-Ridge, β-Lasso
1.82
0.02
0.35
1.00
3.15E-05
1.00
1.00
1.00
𝜃-Ridge, β-Horseshoe
0.11
0.03
0.76
1.00
1.12E-05
0.01
1.00
1.00
𝜃-Horseshoe, β-Ridge
0.21
0.02
0.49
1.00
6.79E-06
0.03
0.99
1.00
𝜃-Horseshoe, β-Lasso
0.42
0.02
0.58
1.00
3.81E-05
0.02
1.00
1.00
𝜃-Horseshoe, β-Horseshoe
0.08
0.01
0.76
1.00
1.51E-06
0.00
1.00
1.00
Posterior medians of κii and κij are calculated at the product level and then summarized for each model. There is maximum shrinkage when κ = 1 and no shrinkage when κ = 0
Across models, we find that hierarchical priors impose heavier shrinkage, on average, than standard global-local priors—especially for own elasticities. For example, the average shrinkage factor κii is 0.19 for the β-Horseshoe model but 0.76 for the (β-Ridge, β-Horseshoe) model. If the shrinkage points are misspecified (as appears to be the case for standard priors with a mean fixed at zero), then the prior variances will need to get larger to accommodate deviations from zero. Since the hierarchical priors center the product-level elasticities around more reasonable values, then the prior variance can get smaller and shrinkage will “kick in” for noisy estimates. For differences in shrinkage factors across product categories, see Appendix C.4 where we plot the empirical CDF of posterior medians of κii across both categories and models. We find appreciable variation in the strength of category-level shrinkage. While there is variation across models, we find that shrinkage tends to be heaviest in categories like BEER/ALE and CARBONATED BEVERAGES, and lightest in FZ PIZZA and SALTY SNACKS.

6.4 Discussion

So far, we have provided evidence that hierarchical shrinkage priors can lead to more negative and precisely estimated own elasticities, larger and more dispersed cross elasticities, and improvements in out-of-sample demand predictions. We close this section with a discussion of implications for analysts and managers. We first discuss the nature of price variation in retail markets, and show how judicious prior choices can help mitigate challenges of estimating price elasticities with weakly informative data. We then discuss the benefits of imposing hierarchical shrinkage when producing competitive maps used for market structure analysis.

6.4.1 Retail prices and prior regularization

In the era of big data it may be tempting to believe that prior specification is less of a concern as the data will simply overwhelm the prior with a sufficiently large sample size. We argue that this view is misguided in a demand estimation context because more data does not necessarily imply more variation in key demand shifters.4 In many retail markets, for example, prices are notoriously “sticky” (Bils & Klenow, 2004) and exhibit limited variation over time—a feature that need not dissipate as more data are collected. With limited variation in prices, price coefficients will in turn be subject to heavy regularization. Analysts interested in using observational data to estimate price elasticities will almost always face a problem of weakly informative data, calling for more judicious prior choices.
We illustrate the interplay between weakly informative data and regularization in Fig. 5. Each scatter plot compares the own elasticities from a sparse prior (x-axis) to the own elasticities from a hierarchical prior (y-axis). For example, the top left corner compares estimates from the sparse β-Ridge prior to estimates from the hierarchical (𝜃-Ridge, β-Ridge) prior. Each point is one of 275 products colored according to the strength of the signal provided its price vector. Our specific measure of signal strength for product j is the standard deviation of log prices across weeks in the training data: \(\text {SD}_{j}=\text {SD}(\log p_{j1},\dotsc ,\log p_{jn})\). Red triangles represent products in the bottom quartile of this distribution (i.e., products with relatively limited price variation), green squares represent products in the top quartile of this distribution, and grey circles represent products in the second and third quartiles.
We find that most of green squares fall along the diagonal line, implying that the own elasticity estimates produced by sparse and hierarchical priors are similar for goods with relatively high price variation. In contrast, most of the red triangles fall below the diagonal line where estimates from the sparse prior estimates are greater than estimates from the hierarchical prior. This set of products illustrates the value of hierarchical shrinkage: in the absence of a strong signal in the data, we can produce more economically reasonable estimates by imposing shrinkage towards higher-level elasticities rather than zero.

6.4.2 Interpretable market structure

A second implication of hierarchical, non-sparse shrinkage is that it can allow for more interpretable market structure analysis. Marketing researchers have a long history of using competitive maps—i.e., visual representations of product or brand competition—for understanding customer preferences and guiding managerial decisions related to brand positioning (Elrod, 1988; Allenby, 1989; Rutz & Sonnier, 2011), product line management (Chintagunta, 1998), and assortment planning (Sinha et al., 2013). One challenge in producing competitive maps for large assortments is that it is hard to flexibly estimate demand at scale.5 As we have shown, log-linear demand models with appropriate forms of regularization can produce estimates of large elasticity matrices. Do different priors lead to appreciably different inferences about competition and market structure?
Figure 6 plots elasticity-based competitive maps across all nine prior specifications. Each map is made by using a two-dimensional t-SNE projection (van der Maaten & Hinton, 2008) of the estimated price elasticity matrix. Each point on the map is one of 275 products and distances between products reflect degrees of substitution, with more related products appearing closer together. Points are colored based on category labels to help interpret competitive boundaries. We find substantial differences in the interpretability of competitive maps across sparse and hierarchical prior specifications. The maps produced by standard shrinkage priors do not, at first glance, reflect any deeper structure of demand. In contrast, the maps produced by hierarchical shrinkage priors feature a co-clustering of goods belonging to the same or similar categories. For example, in the (𝜃-Ridge, β-Ridge) map, most beverages appear on the right-hand side and many beers and carbonated beverages are located near clusters of salty snacks, frozen dinners, and frozen pizza.
In Appendix C5, we show a subcategory-level version of Fig. 6 for two selected models to show how hierarchical shrinkage leads to more coherent subcategory clustering behavior. We also provide zoomed-in plots for the SALTY SNACK category to examine the ways in which brands co-cluster. In the hierarchical shrinkage map, we find that category and subcategory labels tend have more influence over clustering outcomes than brand labels. For instance, we do not find that brands co-cluster across subcategories (as would be expected if umbrella branding or quality tiers induced strong cross-subcategory substitution, for example). Instead, we find that clusters tend to include all brands within the same subcategory. Relative to hierarchical shrinkage maps, standard shrinkage maps exhibit no discernible clustering behavior across brands or subcategories. Overall, we believe that improvements in map interpretability can lead to better inferences about product competition and more valuable insights for for decision-making.

7 Conclusion

This paper studies shrinkage priors for high-dimensional demand systems. We propose a hierarchical extension to the class of global-local priors where prior means and variances are parameterized by product classification trees that commonly accompany retail scanner data sets. The principal benefit is that the price elasticity between two goods will be shrunk towards higher-level category elasticities rather than zero. We also formally examine the shrinkage properties of hierarchical global-local priors and show that including a heavy tailed mixing distribution at any level of the tree is sufficient for imposing heavy shrinkage for product-level elasticities.
We apply our hierarchical priors to the elasticity parameters of a log-linear demand model in which store-level sales are regressed on a high-dimensional vector of prices as well as seasonal trends and other product controls. We propose a simple modified version of the fast sampling algorithm in Bhattacharya et al. (2016) to help alleviate the typical computational bottleneck that arises when inverting the posterior precision matrix. We then use both simulated data and retail scanner data to show the value of non-sparse shrinkage. Our simulation experiments shed light on the situations in which different types of shrinkage are most/least appropriate. Hierarchical priors can offer significant gains when the true elasticity matrix is dense and inherits its structure from a product classification tree. Moreover, hierarchical priors are fairly robust to tree misspecification and are still competitive with standard sparsity priors when the parameters are actually sparse.
Our empirical application uses two years of data to estimate 75,625 price elasticities associated with an assortment of 275 goods that span 28 subcategories and nine categories. We find that the hierarchical prior with ridge shrinkage throughout the tree provides the best predictive performance. We also find that hierarchical priors provide much more reasonable estimates of own elasticities than the standard shrinkage priors which impose a fixed shrinkage point at zero. The hierarchical global-local structure also allows us to learn about within and cross-subcategory and category elasticities, which is useful for identifying boundaries of competition. Our results provide evidence that, in the absence of a strong signal in the data, hierarchical shrinkage can lead to improvements in different aspects of demand estimation relative to “off-the-shelf” regularization. More generally, we believe our work highlights the importance of judicious priors in high-dimensional estimation problems.

Declarations

Conflict of Interests

The authors have no competing interests to declare that are relevant to the content of this article.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Appendix A: Posterior computation with hierarchical global-local priors

A.1: Full Conditionals

Means
The higher-level elasticity parameters 𝜃π(ij|) are assumed to follow a normal prior and therefore have a normal full conditional distribution: \(\theta _{\pi (ij|\ell )}|\text {else} \sim \text {N}(\tilde {\theta }_{\pi (ij|\ell )},V_{\pi (ij|\ell )})\), where
$$ \begin{array}{@{}rcl@{}} \tilde{\theta}_{\pi(ij|\ell)} &=& V_{\pi(ij|\ell)}\left( {\sum}_{a\in\mathcal{C}_{i}^{\ell}} {\sum}_{b\in\mathcal{C}_{j}^{\ell}}\frac{\theta_{\pi(ab|\ell-1)}}{\tau_{\ell-1}^{2}{\Psi}_{\pi(ab|\ell-1)} } + \frac{\theta_{\pi(ij|\ell+1)}}{\tau_{\ell}^{2}{\Psi}_{\pi(ij|\ell)} }\right),\\ V_{\pi(ij|\ell)}&=&\left( {\sum}_{a\in\mathcal{C}_{i}^{\ell}}{\sum}_{b\in\mathcal{C}_{j}^{\ell}} \frac{1}{\tau_{\ell-1}^{2}{\Psi}_{\pi(ab|\ell-1)}} + \frac{1}{\tau_{\ell}^{2}{\Psi}_{\pi(ij|\ell)}}\right)^{-1}. \end{array} $$
Here \(\mathcal {C}_{i}^{\ell }=\left \{i^{\prime }\in \{1,\dotsc ,n_{\ell }\}:\pi (i^{\prime }j|\ell )=\pi (ij|\ell )\right \}\) and \(\mathcal {C}_{j}^{\ell }=\left \{j^{\prime }\in \{1,\dotsc ,n_{\ell }\}:\pi (ij^{\prime }|\ell )=\pi (ij|\ell )\right \}\) each represent sets of group indices that share a common ancestry with group i or j at level .
Local variances
First rewrite Eq. 6 as \({\Psi }_{\pi (ij|\ell )}=\lambda _{\pi (ij|\ell )}^{2} {\prod }_{s=\ell +1}^{L-1} \lambda _{\pi (ij|s)}^{2}=\lambda _{\pi (ij|\ell )}^{2} {\Psi }_{\pi (ij|\ell +1)}\) and so at level , all \(\lambda _{\pi (ij|\ell )}^{2}\) terms need to be sampled from their full conditional distribution given Ψπ(ij|+ 1). Here we focus on the case where \(\lambda _{\pi (ij|\ell )}\sim \text {C}^{+}(0,1)\), which corresponds to horseshoe shrinkage. We exploit the following scale mixture representation of the half-Cauchy prior (Makalic and Schmidt, 2015): if \(\lambda ^{2}|\zeta \sim \text {IG}(1/2,1/\zeta )\) and \(\zeta \sim \text {IG}(1/2,1)\), then \(\lambda \sim \text {C}^{+}(0,1)\).
$$ \begin{array}{@{}rcl@{}} \lambda_{\pi(ij|\ell)}^{2}|\zeta_{\pi(ij|\ell)},\text{else} &\sim& \text{IG}\left( \frac{1}{2} + \frac{1}{2}{\sum}_{s\leq\ell} {n_{s}^{2}},\frac{1}{\zeta_{\pi(ij|\ell)}}+ \frac{(\theta_{\pi(ij|\ell)}-\theta_{\pi(ij|\ell+1)})^{2}}{2\tau_{\ell}^{2}{\Psi}_{\pi(ij|\ell+1)}}\right.\\ &&\left.+\frac{1}{2} {\sum}_{s<\ell} {\sum}_{a\in\mathcal{C}_{i}^{s}} {\sum}_{b\in\mathcal{C}_{j}^{s}} \frac{(\theta_{\pi(ab|s)}-\theta_{\pi(ab|s+1)})^{2}}{\tau_{s}^{2}{\Psi}_{\pi(ab|s+1)}} \right)\\ \zeta_{\pi(ij|\ell)}|\lambda_{\pi(ij|\ell)}^{2} &\sim& \text{IG}\left( 1,1+\frac{1}{\lambda_{\pi(ij|\ell)}^{2}} \right) \end{array} $$
Global variances
We place a half-Cauchy prior on τ and use the same scale mixture representation outlined above for the local variances.
$$ \begin{array}{@{}rcl@{}} \tau_{\ell}^{2}|\zeta_{\ell},\text{else} &\sim&\text{IG}\left( \frac{n_{\ell}^{2}+1}{2},\frac{1}{\zeta_{\ell}}+\frac{1}{2}{\sum}_{a=1}^{n_{\ell}^{2}}{\sum}_{b=1}^{n_{\ell}^{2}}\frac{\big(\theta_{\pi(ab|\ell)}-\theta_{\pi(ab|\ell+1)}\big)^{2}}{{\Psi}_{\pi(ab|\ell)}} \right)\\ \zeta_{\ell}|\tau_{\ell}^{2} &\sim& \text{IG}\left( 1,1+\frac{1}{\tau_{\ell}^{2}} \right)\\ \end{array} $$

A.2: Fast sampling algorithm

We show that our version of the Bhattacharya et al. (2016) fast sampling algorithm, as outlined in Section 4.2, produces a draw of β with the correct posterior mean and covariance matrix:
$$ \begin{array}{@{}rcl@{}} \mathbb{E}(\beta|\text{data}) &=&(\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}(\tilde{\text{X}}^{\prime}\tilde{y}+{\Lambda}_{*}^{-1}\bar{\beta}(\theta))\\\text{Cov}(\beta|\text{data}) &=&(\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1} \end{array} $$
where \(\bar {\beta }(\theta )\) denotes the prior mean of β which depends on higher-level elasticity parameters 𝜃. Note that our proofs below rely on the Woodbury matrix identity: (A + UCV )− 1 = A− 1A− 1U(C− 1 + V A− 1U)− 1V A− 1.
$$ \begin{array}{@{}rcl@{}} \mathbb{E}(\beta|\text{data}) &=& \mathbb{E}(u) + {\Lambda}_{*} \tilde{\text{X}}^{\prime}\mathbb{E}(w)\\ &=& \mathbb{E}(u) + {\Lambda}_{*} \tilde{\text{X}}^{\prime}\mathbb{E}\left( (\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}(\tilde{y}-v)\right)\\ &=& \mathbb{E}(u) + {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}(\tilde{y}-\mathbb{E}(v))\\ &=& \mathbb{E}(u) + {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}(\tilde{y}-\tilde{\text{X}}\mathbb{E}(u)-\mathbb{E}(\delta))\\ &=& {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}\tilde{y} + \bar{\beta}(\theta) - {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}\tilde{\text{X}}\bar{\beta}(\theta) \\ &=& (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}\tilde{\text{X}}^{\prime}\tilde{y}\\ &&+ \left( I - {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}\tilde{\text{X}}\right)\bar{\beta}(\theta) \ \text{ by Woodbury}\\ &=& (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}\tilde{\text{X}}^{\prime}\tilde{y} + \left( I - {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}\tilde{\text{X}}\right){\Lambda}_{*}{\Lambda}_{*}^{-1}\bar{\beta}(\theta) \\ &=& (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}\tilde{\text{X}}^{\prime}\tilde{y} + \left( {\Lambda}_{*} - {\Lambda}_{*} \tilde{\text{X}}^{\prime}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}\tilde{\text{X}}{\Lambda}_{*}\right){\Lambda}_{*}^{-1}\bar{\beta}(\theta)\\ &=& (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}\tilde{\text{X}}^{\prime}\tilde{y} + (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1}){\Lambda}_{*}^{-1}\bar{\beta}(\theta) \ \text{ by Woodbury}\\ &=& (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1}(\tilde{\text{X}}^{\prime}\tilde{y}+{\Lambda}_{*}^{-1}\bar{\beta}(\theta))\\ \text{Cov}(\beta|\text{data}) &=& \text{Cov}(u + {\Lambda}_{*}^{\prime}\tilde{\text{X}}w)\\ &=& \text{Cov}(u + {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma}(\tilde{y}-v))\\ &=& \text{Cov}(u - {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} v)\\ &=& \text{Var}(u) + \text{Var}({\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} v) - \text{Cov}(u,{\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} v) - \text{Cov}({\Lambda}_{*} ^{\prime}\tilde{\text{X}}{\Sigma} v, u)\\ &=& \text{Var}(u) + {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} \text{Var}(v){\Sigma} \tilde{\text{X}}^{\prime}{\Lambda}_{*}\\ &&- \text{Cov}(u,v){\Sigma} \tilde{\text{X}}^{\prime}{\Lambda}_{*} - {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} \text{Cov}(v, u)\\ &=& {\Lambda}_{*} + {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} {\Sigma}^{-1}{\Sigma} \tilde{\text{X}}^{\prime}{\Lambda}_{*} - {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} \tilde{\text{X}}^{\prime}{\Lambda}_{*} - {\Lambda}_{*}^{\prime}\tilde{\text{X}}{\Sigma} \tilde{\text{X}}^{\prime}{\Lambda}_{*} \\ &=& {\Lambda}_{*} - {\Lambda}_{*}^{\prime}\tilde{\text{X}}(\tilde{\text{X}}{\Lambda}_{*} \tilde{\text{X}}^{\prime}+I)^{-1}\tilde{\text{X}}^{\prime}{\Lambda}_{*} \\ &=& (\tilde{\text{X}}^{\prime}\tilde{\text{X}}+{\Lambda}_{*}^{-1})^{-1} \ \text{ by Woodbury} \\ \end{array} $$

Appendix B: Additional simulation experiments

B.1: Parameter recovery

The goal of this section is to show that the MCMC algorithms outlined in Section 4 can accurately recover model parameters at each level of the tree. We focus on one of the 25 data sets generated by specification (I) with n = 50 and p = 100 (for details, see the discussion in Section 5). The tree corresponding to this data set has 5 categories, 10 subcategories, and 100 products. There are then 52 = 25 category elasticities 𝜃π(ij|2), 102 = 100 subcategory elasticities 𝜃π(ij|1), and 1002 = 10, 000 product elasticities βij.
For simplicity, we report estimates from two models: the (𝜃-Ridge, β-Ridge) hierarchical prior and the standard (β-Ridge) prior. Figure 7 plots the posterior mean of each elasticity parameter against its true value. The vertical lines correspond to 95% credible intervals. The first three figures report estimates from the hierarchical shrinkage prior and the last figure reports estimates from the standard shrinkage prior. We find that all 95% credible intervals cover true values for the category and subcategory-level elasticities, and nearly all credible intervals cover true values for product-level elasticities.
The benefits of hierarchical shrinkage can be seen by comparing the estimates of product-level elasticities βij in the last two figures. With high-dimensional p > n data, the design matrix will be rank deficient and the regularization imposed by the prior will bite. The last figure on the right shows that default prior assumptions can produce very noisy and biased estimates, especially for elasticities that are larger in magnitude. In comparison, our hierarchical priors shrink noisy estimates towards more reasonable values by pooling information across products. Overall, this further emphasizes the importances of judicious prior choices in high-dimensional settings.

B.2: Tree misspecification

The simulation experiments in Section 5 assume that the tree used to construct the hierarchical shrinkage prior matches the “true” tree used to generate the data. When working with observational data, the existence or structure of a true tree is not known so there may be a concern of tree misspecification. In the context of scanner data, the reported product classification tree could be misspecified if it does not reflect consumers’ true boundaries of substitution across product groups. For example, in the IRI scanner data set “Ready-to-Eat Cereal” is the only subcategory within the CEREAL category. This part of the tree may be too coarse if there is a more granular partitioning of cereal brands, flavors, and qualities that better aligns with consumer cross-product substitution.
In this section we use the same 25 (n = 50, p = 100) simulated data sets from specification (I) in Section 5 to explore consequences of tree misspecification. Let Ttrue denote the tree used to simulate the data. Ttrue has five equal-sized categories on the top level (each having two child subcategories), 10 equal-sized subcategories on the middle level (each having 10 child products), and 100 products on the bottom level. In addition to standard shrinkage priors and hierarchical priors based on Ttrue, we also fit hierarchical priors based on the following three misspecified trees.
  • In \(T_{A}^{\prime }\), the ordering of group nodes on the top and middle levels are kept the same, but the bottom-level nodes (the “products”) are permuted. In Ttrue, products fill up one subcategory at a time—i.e., goods \(1,\dotsc ,10\) are assigned to subcategory 1, goods \(11, \dotsc ,20\) are assigned to subcategory 2, and so on. We generate \(T_{A}^{\prime }\) by assigning the first product to subcategory 1, the second product to subcategory 2, and so on until each subcategory has one product. This process repeats until all products have been assigned into a subcategory. The resulting tree shares no common structure with Ttrue.
  • In \(T_{B}^{\prime }\), half of the tree is misspecified (in the same way as \(T_{A}^{\prime }\)) while the other half matches Ttrue.
  • In \(T_{C}^{\prime }\), half of the tree is misspecified as an overly granular version of Ttrue (each group is split into two) while the other half matches Ttrue.
Examples of each form of misspecification are provided in Fig. 8 and simulation results are reported in Table 8. For each misspecified tree, we also report the distance between \(T^{\prime }\) and Ttrue using the adjusted Rand index (ARI). Similarity is maximized when \(ARI(T^{\prime },T_{true})=1\). Our results indicate that hierarchical shrinkage can still provide improved estimates over standard shrinkage methods—and is never worse than standard shrinkage methods—even when the tree is misspecified. However, the magnitude of these gains vanishes as the degree of misspecification grows. If an analyst is concerned with the threat of misspecification, then different tree structures could be fit to the data and compared based on out-of-sample fit metrics.
Table 8
Effects of tree misspecification
 
Hierarchical Shrinkage with \(T_{A}^{\prime }\)
Estimation RMSE
Correct Signs
A)
\(ARI(T_{A}^{\prime },T_{true})=-0.02\)
(I)
(II)
(III)
(I)
(II)
(III)
 
𝜃-Ridge, β-Ridge
2.482
2.340
1.207
0.77
0.79
0.88
 
𝜃-Ridge, β-Lasso
2.552
2.429
1.066
0.77
0.77
0.89
 
𝜃-Ridge, β-Horseshoe
2.800
2.697
0.951
0.73
0.73
0.89
 
𝜃-Horseshoe, β-Ridge
2.533
2.399
1.220
0.76
0.77
0.86
 
𝜃-Horseshoe, β-Lasso
2.587
2.468
1.088
0.76
0.76
0.88
 
𝜃-Horseshoe, β-Horseshoe
2.815
2.713
0.962
0.72
0.72
0.88
B)
\(ARI(T_{B}^{\prime },T_{true})=0.55\)
(I)
(II)
(III)
(I)
(II)
(III)
 
𝜃-Ridge, β-Ridge
1.808
1.548
0.950
0.87
0.89
0.91
 
𝜃-Ridge, β-Lasso
1.820
1.538
0.866
0.86
0.89
0.92
 
𝜃-Ridge, β-Horseshoe
2.016
1.738
0.783
0.84
0.86
0.92
 
𝜃-Horseshoe, β-Ridge
1.704
1.340
0.711
0.87
0.91
0.93
 
𝜃-Horseshoe, β-Lasso
1.765
1.420
0.678
0.86
0.90
0.94
 
𝜃-Horseshoe, β-Horseshoe
1.958
1.645
0.668
0.84
0.87
0.93
 
Hierarchical Shrinkage with \(T_{C}^{\prime }\)
      
C)
\(ARI(T_{C}^{\prime },T_{true})=0.63\)
(I)
(II)
(III)
(I)
(II)
(III)
 
𝜃-Ridge, β-Ridge
1.047
0.537
0.522
0.93
0.97
0.94
 
𝜃-Ridge, β-Lasso
1.060
0.471
0.459
0.93
0.98
0.95
 
𝜃-Ridge, β-Horseshoe
1.138
0.394
0.408
0.92
0.98
0.95
 
𝜃-Horseshoe, β-Ridge
1.064
0.529
0.193
0.93
0.97
0.98
 
𝜃-Horseshoe, β-Lasso
1.073
0.469
0.194
0.93
0.98
0.98
 
𝜃-Horseshoe, β-Horseshoe
1.147
0.394
0.261
0.92
0.98
0.97
 
Hierarchical Shrinkage with Ttrue
      
D)
ARI(Ttrue,Ttrue) = 1
(I)
(II)
(III)
(I)
(II)
(III)
 
𝜃-Ridge, β-Ridge
1.033
0.530
0.515
0.94
0.97
0.94
 
𝜃-Ridge, β-Lasso
1.047
0.461
0.451
0.93
0.98
0.95
 
𝜃-Ridge, β-Horseshoe
1.126
0.385
0.407
0.92
0.98
0.95
 
𝜃-Horseshoe, β-Ridge
1.044
0.526
0.190
0.93
0.98
0.98
 
𝜃-Horseshoe, β-Lasso
1.053
0.461
0.192
0.93
0.98
0.98
 
𝜃-Horseshoe, β-Horseshoe
1.130
0.386
0.270
0.92
0.98
0.97
E)
Standard Shrinkage
(I)
(II)
(III)
(I)
(II)
(III)
 
β-Ridge
2.541
2.375
1.219
0.77
0.78
0.87
 
β-Lasso
2.622
2.473
1.079
0.76
0.77
0.89
 
β-Horseshoe
2.879
2.745
0.969
0.73
0.74
0.88
The table reports average fit statistics across 25 simulated data sets from specification (I) in Section 5 with n = 50 and p = 100. The first three panels show the estimation results from hierarchical priors with misspecified trees. In each case, we use the adjusted Rand index (ARI) to report the similarity between the misspecified tree \(T^{\prime }\) and Ttrue. The two benchmark cases from Section 5 (i.e., hierarchical priors based on Ttrue and standard shrinkage priors) are shown in the last two panels

Appendix C: Additional empirical details and results

C.1: Summary of Priors and Hyperparameter Specifications

Standard shrinkage priors
  • Cross elasticities
    $$ {\beta}_{ij} \sim \text{N}\left( 0,{\lambda}_{ij}^{2}{\tau}_{\text{cross}}^{2}\right),\quad {\tau}_{\text{cross}}\sim{\text{C}}^{+}(0,1) $$
  • Own elasticities
    $$ {\beta}_{ii} \sim \text{N}\left( 0,{\lambda}_{ii}^{2}{\tau}_{\text{own}}^{2}\right), \quad {\tau}_{\text{own}}\sim{\text{C}}^{+}(0,1) $$
  • Control coefficient vector (product intercepts, seasonal variables, promotional variables)
    $$ {\phi}_{i} \sim \text{N}(0,{10}^{2}I_{d}) $$
  • Observational error variance
    $$ {{\sigma}_{i}^{2}} \sim \text{IG}(5,5) $$
Hierarchical shrinkage priors
  • Cross elasticities
    $$ \begin{array}{@{}rcl@{}} {\beta}_{ij} \sim \text{N}\left( {\theta}_{\pi(ij|1)},{\Psi}_{ij}{\tau}_{\beta\text{cross}}^{2}\right), &\quad& {\tau}_{\beta\text{cross}^{\text{C}}}^{+}(0,1)\\ {\theta}_{\pi(ij|1)} \text{N}\left( {\theta}_{\pi(ij|2)},{\Psi}_{\pi(ij|1)}{\tau}_{1\text{cross}}^{2}\right), &\quad& {\tau}_{1\text{cross}\text{C}}^{+}(0,1)\\ {\theta}_{\pi(ij|2)} \text{N}\left( {\bar{\theta}}_{\text{own}},{\Psi}_{\pi(ij|2)}{\tau}_{2\text{cross}}^{2}\right), &\quad& \tau{\text{cross}}^{2}{\text{C}}^{+}(0,1)\\ {\bar{\theta}}_{\text{cross}} \text{N}(0,1)&& \end{array} $$
  • Own elasticities
    $$ \begin{array}{@{}rcl@{}} {\beta}_{ii} \text{N}\left( {\theta}_{\pi(ii|1)},{\Psi}_{ii}{\tau}_{\beta\text{own}}^{2}\right), &\quad& {{\tau}_{\beta\text{own}}^{\text{C}}}^{+}(0,1)\\ {\theta}_{\pi(ii|1)}\text{N}\left( {\theta}_{\pi(ii|2)},{\Psi}_{\pi(ii|1)}{\tau}_{1\text{own}}^{2}\right), &\quad& {{\tau}_{1\text{own}}^{\text{C}}}^{+}(0,1)\\ {\theta}_{\pi(ii|2)}\text{N}\left( {\bar{\theta}}_{\text{cross}},{\Psi}_{\pi(ii|2)}{\tau}_{2\text{own}}^{2}\right), &\quad& {\tau}_{2\text{own}}\text{C}^{+}(0,1)\\ \bar{\theta}_{\text{own}}\text{N}(0,1)&& \end{array} $$
  • Control coefficient vector (product intercepts, seasonal variables, promotional variables)
    $${\phi}_{i}\text{N}(0,10^{2}I_{d})$$
  • Observational error variance
    $${{\sigma}_{i}^{2}}\text{IG}(5,5)$$

C.2: Mixing diagnostics

C.3: Distributions of Price Elasticity and Promotional Effect Estimates

C.4: Shrinkage Factors by Category

C.5: Competitive maps by category

Footnotes
1
There is a closely related literature on modeling purchase data alone (Jacobs et al., 2016; Kumar et al., 2020; Jacobs et al., 2021) which uses similar methods but abstracts away from demand estimation in the sense that prices do not enter the modeling framework.
 
2
Many existing large-scale methods still rely on assumptions of discrete choice. For example, Donnelly et al. (2021) and Gabel and Timoshenko (2021) model purchases as a two-stage process: first category incidence and then discrete choice among products within each of the chosen categories. Ruiz et al. (2020) propose a behavioral model of sequential discrete choice: a consumer decides which item to add to their basket (or to stop shopping) conditional on previously chosen items.
 
3
Note that some of the earliest examples of Bayesian shrinkage arise under the model in Eq. 2 with G being discrete. For example, spike-and-slab priors (Mitchell & Beauchamp, 1988; George & McCulloch, 1993) arise under a two-point mixture distribution \(G(\lambda ^{2})=w\delta _{\lambda ^{2}=1} + (1-w)\delta _{\lambda ^{2}=0}\).
 
4
We abstract away from the discussion of “clean” variation and price endogeneity. In our empirical application, we only rely on temporal variation rather than cross-market variation in prices which is the typical threat to exogeneity (Rossi, 2014).
 
5
Visualizing market structure for large assortments remains an active area of research. In recent years, several papers have bypassed the challenge of estimating demand at scale and have instead developed visualization methods that either use purchase outcome data alone (France & Ghose, 2016; Gabel et al., 2019), or use auxiliary data such as data on consumer search (Kim et al., 2011; Ringel & Skiera, 2016) or online text reviews (Netzer et al., 2012).
 
Literature
go back to reference Allenby, G.M. (1989). A unified approach to identifying, estimating and testing demand structures with aggregate scanner data. Marketing Science, 8 (3), 265–280.CrossRef Allenby, G.M. (1989). A unified approach to identifying, estimating and testing demand structures with aggregate scanner data. Marketing Science, 8 (3), 265–280.CrossRef
go back to reference Amano, T., Rhodes, A., & Seiler, S. (2019). Large-scale demand estimation with search data. Working Paper. Amano, T., Rhodes, A., & Seiler, S. (2019). Large-scale demand estimation with search data. Working Paper.
go back to reference Bajari, P., Nekipelov, D., Ryan, S.P., & Yang, M. (2015). Demand estimation with machine learning and model combination. NBER Working Paper No. 20955. Bajari, P., Nekipelov, D., Ryan, S.P., & Yang, M. (2015). Demand estimation with machine learning and model combination. NBER Working Paper No. 20955.
go back to reference Barndorff-Nielsen, O.E., Kent, J.T., & Sørensen, M. (1982). Normal variance-mean mixtures and z distributions. International Statistical Review, 50, 145–159.CrossRef Barndorff-Nielsen, O.E., Kent, J.T., & Sørensen, M. (1982). Normal variance-mean mixtures and z distributions. International Statistical Review, 50, 145–159.CrossRef
go back to reference Bhadra, A., Datta, J., Polson, N.G., & Willard, B. (2016). Default bayesian analysis with global-local shrinkage priors. Biometrika, 103(4), 955–969.CrossRef Bhadra, A., Datta, J., Polson, N.G., & Willard, B. (2016). Default bayesian analysis with global-local shrinkage priors. Biometrika, 103(4), 955–969.CrossRef
go back to reference Bhadra, A., Datta, J., Polson, N.G., & Willard, B.T. (2019). Lasso meets horseshoe: A survey. Statistical Science, 34(3), 405–427.CrossRef Bhadra, A., Datta, J., Polson, N.G., & Willard, B.T. (2019). Lasso meets horseshoe: A survey. Statistical Science, 34(3), 405–427.CrossRef
go back to reference Bhattacharya, A., Chakraborty, A., & Mallick, B.K. (2016). Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4), 985–991.CrossRef Bhattacharya, A., Chakraborty, A., & Mallick, B.K. (2016). Fast sampling with gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4), 985–991.CrossRef
go back to reference Bien, J., Taylor, J., & Tibshirani, R. (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3), 1111–1141.CrossRef Bien, J., Taylor, J., & Tibshirani, R. (2013). A lasso for hierarchical interactions. The Annals of Statistics, 41(3), 1111–1141.CrossRef
go back to reference Bils, M., & Klenow, P.J. (2004). Some evidence on the importance of sticky prices. Journal of Political Economy, 112(5), 947–985.CrossRef Bils, M., & Klenow, P.J. (2004). Some evidence on the importance of sticky prices. Journal of Political Economy, 112(5), 947–985.CrossRef
go back to reference Bingham, N.H., Goldie, C.M., & Teugels, J.L. (1987). Regular Variation. Cambridge: Cambridge University Press.CrossRef Bingham, N.H., Goldie, C.M., & Teugels, J.L. (1987). Regular Variation. Cambridge: Cambridge University Press.CrossRef
go back to reference Blattberg, R.C., & George, E.I. (1991). Shrinkage estimation of price and promotional elasticities: Seemingly unrelated equations. Journal of the American Statistical Association, 86(414), 304–315.CrossRef Blattberg, R.C., & George, E.I. (1991). Shrinkage estimation of price and promotional elasticities: Seemingly unrelated equations. Journal of the American Statistical Association, 86(414), 304–315.CrossRef
go back to reference Boatwright, P., McCulloch, R., & Rossi, P. (1999). Account-level modeling for trade promotion: an application of a constrained parameter hierarchical model. Journal of the American Statistical Association, 94(448), 1063–1073.CrossRef Boatwright, P., McCulloch, R., & Rossi, P. (1999). Account-level modeling for trade promotion: an application of a constrained parameter hierarchical model. Journal of the American Statistical Association, 94(448), 1063–1073.CrossRef
go back to reference Bronnenberg, B.J., Kruger, M.W., & Mela, C.F. (2008). Database paper—The IRI marketing data set. Marketing Science, 27(4), 745–748.CrossRef Bronnenberg, B.J., Kruger, M.W., & Mela, C.F. (2008). Database paper—The IRI marketing data set. Marketing Science, 27(4), 745–748.CrossRef
go back to reference Carvalho, C.M., Polson, N.G., & Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465–480.CrossRef Carvalho, C.M., Polson, N.G., & Scott, J.G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465–480.CrossRef
go back to reference Chen, F., Liu, X., Proserpio, D., & Troncoso, I. (2020). Product2Vec: Understanding product-level competition using representation learning. Working Paper. Chen, F., Liu, X., Proserpio, D., & Troncoso, I. (2020). Product2Vec: Understanding product-level competition using representation learning. Working Paper.
go back to reference Chintagunta, P.K. (1998). Inertia and variety seeking in a model of brand-purchase timing. Marketing Science, 17(3), 253–270.CrossRef Chintagunta, P.K. (1998). Inertia and variety seeking in a model of brand-purchase timing. Marketing Science, 17(3), 253–270.CrossRef
go back to reference Chiong, K.X., & Shum, M. (2018). Random projection estimation of discrete-choice models with large choice sets. Management Science, 65(1), 256–271.CrossRef Chiong, K.X., & Shum, M. (2018). Random projection estimation of discrete-choice models with large choice sets. Management Science, 65(1), 256–271.CrossRef
go back to reference Cline, D.B.H. (1987). Convolutions of distributions with exponential and subexponential tails. Journal of the Australian Mathematics Society: Series A, 43, 347–365.CrossRef Cline, D.B.H. (1987). Convolutions of distributions with exponential and subexponential tails. Journal of the Australian Mathematics Society: Series A, 43, 347–365.CrossRef
go back to reference Datta, J., & Ghosh, J.K. (2013). Asymptotic properties of bayes risk for the horseshoe prior. Bayesian Analysis, 8(1), 111–132.CrossRef Datta, J., & Ghosh, J.K. (2013). Asymptotic properties of bayes risk for the horseshoe prior. Bayesian Analysis, 8(1), 111–132.CrossRef
go back to reference Deaton, A., & Muellbauer, J. (1980). Economics and consumer behavior. Cambridge: Cambridge University Press.CrossRef Deaton, A., & Muellbauer, J. (1980). Economics and consumer behavior. Cambridge: Cambridge University Press.CrossRef
go back to reference DellaVigna, S., & Gentzkow, M. (2019). Uniform pricing in US retail chains. The Quarterly Journal of Economics, 134(4), 2011–2084.CrossRef DellaVigna, S., & Gentzkow, M. (2019). Uniform pricing in US retail chains. The Quarterly Journal of Economics, 134(4), 2011–2084.CrossRef
go back to reference Diewert, W.E. (1974). Applications of duality theory. In Frontiers of Qumtitatiue Economics. Amsterdam: North-Holland Publishing Company. Diewert, W.E. (1974). Applications of duality theory. In Frontiers of Qumtitatiue Economics. Amsterdam: North-Holland Publishing Company.
go back to reference Donnelly, R., Ruiz, F.R., Blei, D., & Athey, S. (2021). Counterfactual inference for consumer choice across many product categories. Quantitative Marketing and Economics, 19(3), 369–407.CrossRef Donnelly, R., Ruiz, F.R., Blei, D., & Athey, S. (2021). Counterfactual inference for consumer choice across many product categories. Quantitative Marketing and Economics, 19(3), 369–407.CrossRef
go back to reference Eddelbuettel, D., & François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.CrossRef Eddelbuettel, D., & François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18.CrossRef
go back to reference Elrod, T. (1988). Choice map: Inferring a product-market map from panel data. Marketing Science, 7(1), 21–40.CrossRef Elrod, T. (1988). Choice map: Inferring a product-market map from panel data. Marketing Science, 7(1), 21–40.CrossRef
go back to reference Ershov, D., Laliberté, J.-W., Marcoux, M., & Orr, S. (2021). Estimating complementarity with large choice sets: An application to mergers. Working Paper. Ershov, D., Laliberté, J.-W., Marcoux, M., & Orr, S. (2021). Estimating complementarity with large choice sets: An application to mergers. Working Paper.
go back to reference Fessler, P., & Kasy, M. (2019). How to use economic theory to improve estimators: Shrinking toward theoretical restrictions. Review of Economics and Statistics, 101(4), 681–698.CrossRef Fessler, P., & Kasy, M. (2019). How to use economic theory to improve estimators: Shrinking toward theoretical restrictions. Review of Economics and Statistics, 101(4), 681–698.CrossRef
go back to reference France, S.L., & Ghose, S. (2016). An analysis and visualization methodology for identifying and testing market structure. Marketing Science, 35(1), 182–197.CrossRef France, S.L., & Ghose, S. (2016). An analysis and visualization methodology for identifying and testing market structure. Marketing Science, 35(1), 182–197.CrossRef
go back to reference Gabel, S., Guhl, D., & Klapper, D. (2019). P2VMAP: Mapping market structures for large retail assortments. Journal of Marketing Research, 56(4), 557–580.CrossRef Gabel, S., Guhl, D., & Klapper, D. (2019). P2VMAP: Mapping market structures for large retail assortments. Journal of Marketing Research, 56(4), 557–580.CrossRef
go back to reference Gabel, S., & Timoshenko, A. (2021). Product choice with large assortments: a scalable deep-learning model. Management Science, 68(3), 1591–2376. Gabel, S., & Timoshenko, A. (2021). Product choice with large assortments: a scalable deep-learning model. Management Science, 68(3), 1591–2376.
go back to reference Gentzkow, M. (2007). Valuing new goods in a model with complementarity: Online newspapers. American Economic Review, 97(3), 713–744.CrossRef Gentzkow, M. (2007). Valuing new goods in a model with complementarity: Online newspapers. American Economic Review, 97(3), 713–744.CrossRef
go back to reference George, E.I., & McCulloch, R.E. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423), 881–889.CrossRef George, E.I., & McCulloch, R.E. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88(423), 881–889.CrossRef
go back to reference Goldman, S.M., & Uzawa, H. (1964). A note on separability in demand analysis. Econometrica, 32(3), 387–398.CrossRef Goldman, S.M., & Uzawa, H. (1964). A note on separability in demand analysis. Econometrica, 32(3), 387–398.CrossRef
go back to reference Gorman, W. (1971). Two stage budgeting. Unpublished Manuscript, London School of Economics. Gorman, W. (1971). Two stage budgeting. Unpublished Manuscript, London School of Economics.
go back to reference Griffin, J., & Brown, P. (2017). Hierarchical shrinkage priors for regression models. Bayesian Analysis, 12(1), 135–159.CrossRef Griffin, J., & Brown, P. (2017). Hierarchical shrinkage priors for regression models. Bayesian Analysis, 12(1), 135–159.CrossRef
go back to reference Hausman, J., Leonard, G., & Zona, J.D. (1994). Competitive analysis with differentiated products. Annales d’Economie et de Statistique, 34, 159–180.CrossRef Hausman, J., Leonard, G., & Zona, J.D. (1994). Competitive analysis with differentiated products. Annales d’Economie et de Statistique, 34, 159–180.CrossRef
go back to reference Hitsch, G.J., Hortaçsu, A., & Lin, X. (2021). Prices and promotions in US retail markets. Quantitative Marketing and Economics, 19(3), 289–368.CrossRef Hitsch, G.J., Hortaçsu, A., & Lin, X. (2021). Prices and promotions in US retail markets. Quantitative Marketing and Economics, 19(3), 289–368.CrossRef
go back to reference Hoerl, A.E., & Kennard, R.W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.CrossRef Hoerl, A.E., & Kennard, R.W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1), 55–67.CrossRef
go back to reference Ishwaran, H., & Rao, J.S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33(2), 730–773.CrossRef Ishwaran, H., & Rao, J.S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33(2), 730–773.CrossRef
go back to reference Jacobs, B., Donkers, B., & Fok, D. (2016). Model-based purchase predictions for large assortments. Marketing Science, 35(3), 389–404.CrossRef Jacobs, B., Donkers, B., & Fok, D. (2016). Model-based purchase predictions for large assortments. Marketing Science, 35(3), 389–404.CrossRef
go back to reference Jacobs, B., Fok, D., & Donkers, B. (2021). Understanding large-scale dynamic purchase behavior. Marketing Science, 40(5), 844–870.CrossRef Jacobs, B., Fok, D., & Donkers, B. (2021). Understanding large-scale dynamic purchase behavior. Marketing Science, 40(5), 844–870.CrossRef
go back to reference Kim, J.B., Albuquerque, P., & Bronnenberg, B.J. (2011). Mapping online consumer search. Journal of Marketing Research, 48(1), 13–27.CrossRef Kim, J.B., Albuquerque, P., & Bronnenberg, B.J. (2011). Mapping online consumer search. Journal of Marketing Research, 48(1), 13–27.CrossRef
go back to reference Kumar, M., Eckles, D., & Aral, S. (2020). Scalable bundling via dense product embeddings. Working Paper. Kumar, M., Eckles, D., & Aral, S. (2020). Scalable bundling via dense product embeddings. Working Paper.
go back to reference Li, Y., Craig, B.A., & Bhadra, A. (2019). The graphical horseshoe estimator for inverse covariance matrices. Journal of Computational and Graphical Statistics, 28(3), 747–757.CrossRef Li, Y., Craig, B.A., & Bhadra, A. (2019). The graphical horseshoe estimator for inverse covariance matrices. Journal of Computational and Graphical Statistics, 28(3), 747–757.CrossRef
go back to reference Li, Y., Datta, J., Craig, B.A., & Bhadra, A. (2021). Joint mean-covariance estimation via the horseshoe. Journal of Multivariate Analysis, 183, 104716.CrossRef Li, Y., Datta, J., Craig, B.A., & Bhadra, A. (2021). Joint mean-covariance estimation via the horseshoe. Journal of Multivariate Analysis, 183, 104716.CrossRef
go back to reference Makalic, E., & Schmidt, D.F. (2015). A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1), 179–182.CrossRef Makalic, E., & Schmidt, D.F. (2015). A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1), 179–182.CrossRef
go back to reference Manchanda, P., Ansari, A., & Gupta, S. (1999). The “shopping basket”: a model for multicategory purchase incidence decisions. Marketing Science, 18(2), 95–114.CrossRef Manchanda, P., Ansari, A., & Gupta, S. (1999). The “shopping basket”: a model for multicategory purchase incidence decisions. Marketing Science, 18(2), 95–114.CrossRef
go back to reference Mehta, N. (2007). Investigating consumers’ purchase incidence and brand choice decisions across multiple product categories: a theoretical and empirical analysis. Marketing Science, 26(2), 196–217.CrossRef Mehta, N. (2007). Investigating consumers’ purchase incidence and brand choice decisions across multiple product categories: a theoretical and empirical analysis. Marketing Science, 26(2), 196–217.CrossRef
go back to reference Mitchell, T.J., & Beauchamp, J.J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404), 1023–1032.CrossRef Mitchell, T.J., & Beauchamp, J.J. (1988). Bayesian variable selection in linear regression. Journal of the American Statistical Association, 83(404), 1023–1032.CrossRef
go back to reference Montgomery, A.L. (1997). Creating micro-marketing pricing strategies using supermarket scanner data. Marketing Science, 16(4), 315–337.CrossRef Montgomery, A.L. (1997). Creating micro-marketing pricing strategies using supermarket scanner data. Marketing Science, 16(4), 315–337.CrossRef
go back to reference Montgomery, A.L., & Rossi, P.E. (1999). Estimating price elasticities with theory-based priors. Journal of Marketing Research, 36(4), 413–423.CrossRef Montgomery, A.L., & Rossi, P.E. (1999). Estimating price elasticities with theory-based priors. Journal of Marketing Research, 36(4), 413–423.CrossRef
go back to reference Morozov, I. (2020). Measuring benefits from new products in markets with information frictions. Working Paper. Morozov, I. (2020). Measuring benefits from new products in markets with information frictions. Working Paper.
go back to reference Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own business: Market-structure surveillance through text mining. Marketing Science, 31(3), 521–543.CrossRef Netzer, O., Feldman, R., Goldenberg, J., & Fresko, M. (2012). Mine your own business: Market-structure surveillance through text mining. Marketing Science, 31(3), 521–543.CrossRef
go back to reference Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681–686.CrossRef Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association, 103(482), 681–686.CrossRef
go back to reference Pollak, R.A., & Wales, T.J. (1992). Demand system specification and estimation. Oxford: Oxford University Press, Inc. Pollak, R.A., & Wales, T.J. (1992). Demand system specification and estimation. Oxford: Oxford University Press, Inc.
go back to reference Polson, N.G., & Scott, J.G. (2010). Shrink globally, act locally: Sparse bayesian regularization and prediction. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith, & M. West (Eds.) Bayesian Statistics, Vol. 9. Oxford: Oxford University Press. Polson, N.G., & Scott, J.G. (2010). Shrink globally, act locally: Sparse bayesian regularization and prediction. In J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid, D. Heckerman, A.F.M. Smith, & M. West (Eds.) Bayesian Statistics, Vol. 9. Oxford: Oxford University Press.
go back to reference Polson, N.G., & Scott, J.G. (2012a). Local shrinkage rules, lévy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 287–311.CrossRef Polson, N.G., & Scott, J.G. (2012a). Local shrinkage rules, lévy processes and regularized regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2), 287–311.CrossRef
go back to reference Polson, N.G., & Scott, J.G. (2012b). On the half-cauchy prior for a global scale parameter. Bayesian Analysis, 7(4), 887–902.CrossRef Polson, N.G., & Scott, J.G. (2012b). On the half-cauchy prior for a global scale parameter. Bayesian Analysis, 7(4), 887–902.CrossRef
go back to reference Ringel, D.M., & Skiera, B. (2016). Visualizing asymmetric competition among more than 1,000 products using big search data. Marketing Science, 35 (3), 511–534.CrossRef Ringel, D.M., & Skiera, B. (2016). Visualizing asymmetric competition among more than 1,000 products using big search data. Marketing Science, 35 (3), 511–534.CrossRef
go back to reference Rossi, P.E. (2014). Even the rich can make themselves poor: a critical examination of IV methods in marketing applications. Marketing Science, 33(5), 655–672.CrossRef Rossi, P.E. (2014). Even the rich can make themselves poor: a critical examination of IV methods in marketing applications. Marketing Science, 33(5), 655–672.CrossRef
go back to reference Rossi, P.E., Allenby, G.M., & McCulloch, R. (2005). Bayesian statistics and marketing. New York: Wiley.CrossRef Rossi, P.E., Allenby, G.M., & McCulloch, R. (2005). Bayesian statistics and marketing. New York: Wiley.CrossRef
go back to reference Ruiz, F.J.R., Athey, S., & Blei, D.M. (2020). SHOPPER: A probabilistic model of consumer choice with substitutes and complements. The Annals of Applied Statistics, 14(1), 1–27.CrossRef Ruiz, F.J.R., Athey, S., & Blei, D.M. (2020). SHOPPER: A probabilistic model of consumer choice with substitutes and complements. The Annals of Applied Statistics, 14(1), 1–27.CrossRef
go back to reference Rutz, O.J., & Sonnier, G.P. (2011). The evolution of internal market structure. Marketing Science, 30(2), 274–289.CrossRef Rutz, O.J., & Sonnier, G.P. (2011). The evolution of internal market structure. Marketing Science, 30(2), 274–289.CrossRef
go back to reference Semenova, V., Goldman, M., Chernozhukov, V., & Taddy, M. (2021). Estimation and inference about heterogeneous treatment effects in high-dimensional dynamic panels. Working Paper. Semenova, V., Goldman, M., Chernozhukov, V., & Taddy, M. (2021). Estimation and inference about heterogeneous treatment effects in high-dimensional dynamic panels. Working Paper.
go back to reference Sinha, A., Sahgal, A., & Mathur, S.K. (2013). Practice prize paper—Category optimizer: A dynamic-assortment, new-product-introduction, mix-optimization, and demand-planning system. Marketing Science, 32(2), 221–228.CrossRef Sinha, A., Sahgal, A., & Mathur, S.K. (2013). Practice prize paper—Category optimizer: A dynamic-assortment, new-product-introduction, mix-optimization, and demand-planning system. Marketing Science, 32(2), 221–228.CrossRef
go back to reference Smith, A.N., Rossi, P.E., & Allenby, G.M. (2019). Inference for product competition and separable demand. Marketing Science, 38(4), 690–710.CrossRef Smith, A.N., Rossi, P.E., & Allenby, G.M. (2019). Inference for product competition and separable demand. Marketing Science, 38(4), 690–710.CrossRef
go back to reference Song, I., & Chintagunta, P.K. (2006). Measuring cross-category price effects with aggregate store data. Management Science, 52(10), 1594–1609.CrossRef Song, I., & Chintagunta, P.K. (2006). Measuring cross-category price effects with aggregate store data. Management Science, 52(10), 1594–1609.CrossRef
go back to reference Thomassen, Ø., Smith, H., Seiler, S., & Schiraldi, P. (2017). Multi-category competition and market power: a model of supermarket pricing. American Economic Review, 107(8), 2308–2351.CrossRef Thomassen, Ø., Smith, H., Seiler, S., & Schiraldi, P. (2017). Multi-category competition and market power: a model of supermarket pricing. American Economic Review, 107(8), 2308–2351.CrossRef
go back to reference van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605. van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
go back to reference Wales, T.J., & Woodland, A.D. (1983). Estimation of consumer demand systems with binding non-negativity constraints. Journal of Econometrics, 21 (3), 263–285.CrossRef Wales, T.J., & Woodland, A.D. (1983). Estimation of consumer demand systems with binding non-negativity constraints. Journal of Econometrics, 21 (3), 263–285.CrossRef
Metadata
Title
Shrinkage priors for high-dimensional demand estimation
Authors
Adam N. Smith
Jim E. Griffin
Publication date
29-12-2022
Publisher
Springer US
Published in
Quantitative Marketing and Economics / Issue 1/2023
Print ISSN: 1570-7156
Electronic ISSN: 1573-711X
DOI
https://doi.org/10.1007/s11129-022-09260-7