## 1 Introduction

### 1.1 Related literature

^{1}There is also a well-established literature on economic models of multi-category demand (e.g., Manchanda et al., 1999; Song & Chintagunta, 2006; Mehta, 2007; Thomassen et al., 2017; Ershov et al., 2021). However, with the exception of Ershov et al. (2021), microfounded multi-category models become intractable at scale and have only been estimated on data with relatively small choice sets spanning a few categories.

^{2}Log-linear models are also directly parameterized by price elasticities, which is convenient when elasticities are focal objects of interest. That said, log-linear models do not allow for inferences about preference heterogeneity, are not guaranteed to satisfy global regularly conditions, and are less scalable—at least when modeled in a joint, seemingly unrelated regression system—than many existing machine learning methods.

## 2 Background: Regularizing high-dimensional demand

### 2.1 Demand specification

_{it}and price p

_{it}across weeks \(t=1,\dotsc ,n\). In a log-linear demand system, the log of unit sales of product i at time t is regressed on its own log price, the log prices of all other products, and a vector of controls which can include product intercepts, seasonal trends, and promotional activity:

_{ij}’s so we can accommodate both substitutes (β

_{ij}> 0) and complements (β

_{ij}< 0). Log-linear models are also flexible in the sense that they possess enough parameters to serve as a first-order approximation to a valid Marshallian demand system (Diewert, 1974; Pollak & Wales, 1992). However, because there are p

^{2}elasticities in a system with p goods, regularization is needed at scale.

### 2.2 Sparse shrinkage and global-local priors

_{ij}, each having different tail behavior and thus inducing different forms of shrinkage (see, e.g., Bhadra et al., 2019). We specifically focus on global-local priors (Polson & Scott, 2010), which are scale mixtures of normals:

^{2}is the “global” variance that controls the overall amount of shrinkage across all elasticity parameters β

_{ij}while \(\lambda _{ij}^{2}\) is a “local” variance that allows for component-wise deviations from the shrinkage imposed by τ

^{2}. The local variances are distributed according to a mixing distribution G, which is often assumed to be absolutely continuous and will thus admit an associated density g(⋅).

^{3}

_{ij}follows a half-Cauchy distribution. The priors can be compared using their shrinkage profiles, which measure the amount by which the posterior mean shrinks the least squares estimate of a regression coefficient to zero, and is related to the shape of the mixing density g(⋅). For example, the tails of an exponential mixing density are lighter than the polynomial tails of the half-Cauchy, suggesting that the Bayesian lasso may tend to over-shrink large regression coefficients and under-shrink small ones relative to the horseshoe (Polson & Scott, 2010; Datta & Ghosh, 2013; Bhadra et al., 2016).

## 3 Hierarchical global-local priors

### 3.1 Notation

_{ℓ}to be the number of nodes (groups) on level ℓ where n

_{0}> n

_{1}> ⋯ > n

_{L− 1}.

_{ij}, we let 𝜃

_{π(ij|m)}represent the relationship between the level-m ancestors of i and j. An example of this lineage of parameters is shown in Fig. 1. The darkest square in the left-hand-side grid denotes β

_{ij}. The level-1 parent of (i,j) is π(ij|1) = (4,2) and the level-2 parent of (i,j) is π(ij|2) = (2,1). The idea is to then direct shrinkage of β

_{ij}towards 𝜃

_{π(ij|1)}(grid in the middle), which is in turn shrunk towards 𝜃

_{π(ij|2)}(grid on the right).

### 3.2 Prior construction

_{ij}, and prior mean 𝜃

_{π(ij|1)}. Note that this notation holds for any (i,j) pair including the own elasticities where i = j. However, in order to account for differences in the expected signs of own and cross-price elasticities, we will ultimately build up two separate hierarchical structures: one for the β

_{ii}’s and one for the β

_{ij}’s. The former will shrink the product-level own elasticities towards higher-level own-price elasticities whereas the latter will shrink product-level cross elasticities towards higher-level cross-price elasticities. For ease of exposition, we will focus our prior construction on the case where i≠j and the β

_{ij}’s are cross elasticities.

_{π(ij|ℓ)}, Ψ

_{π(ij|ℓ)}is a local variance, and 𝜃

_{π(ij|ℓ+ 1)}is the parent cross-group elasticity to 𝜃

_{π(ij|ℓ)}. This hierarchical mean structure allows the direction of shrinkage to be governed by the classification tree. That is, in the absence of a strong signal in the data, elasticities will be shrunk towards higher-level elasticity parameters instead of zero.

_{ij}and 𝜃

_{π(ij|ℓ)}, respectively, and represent local variances absent any hierarchical structure connecting variances across levels. That is, without a hierarchy of variances we would simply have \({\Psi }_{ij}=\lambda _{ij}^{2}\) and \({\Psi }_{\pi (ij|\ell )}=\lambda _{\pi (ij|\ell )}^{2}\). With the product hierarchy of variances, the induced local variances Ψ will be small whenever either \(\lambda _{\pi (ij|\ell )}^{2}\) is small or any \(\lambda _{\pi (ij|s)}^{2}\) is small for s > ℓ (i.e., higher levels in the tree), which allows shrinkage to propagate down through the tree. Taken together with the hierarchical mean structure, the hierarchal variance structure implies that, in the absence of a strong signal in the data, price elasticities will be strongly shrunk towards higher-level elasticities and these higher-level group elasticities will be strongly shrunk towards each other.

_{ij}that has a fixed mean \(\bar {\theta }\) and does not encode any information about product groups. If the tree has at least two levels, then β

_{ij}will be shrunk towards its parent elasticity and shrinkage will be allowed to propagate. Note that the level at which shrinkage begins to propagate is also a choice of the researcher. In the examples below, shrinkage starts to propagate at the top level (ℓ = L − 1) of the tree.

Parameter | Level | Description |
---|---|---|

Elasticities | ||

β _{ij} | ℓ = 0 | Price elasticity of demand for product i with respect to price of j |

𝜃 _{π(ij|ℓ)} | ℓ ≥ 1 | Level-ℓ ancestor elasticity of the (i,j) product pair |

Local Variances | ||

\(\lambda _{ij}^{2}\) | ℓ = 0 | Local variance for β _{ij} |

\(\lambda _{\pi (ij|\ell )}^{2}\) | ℓ ≥ 1 | Local variance for 𝜃 _{π(ij|ℓ)} |

Ψ _{ij} | ℓ = 0 | Product of all higher-level local variances including \(\lambda _{ij}^{2}\) |

Ψ _{π(ij|ℓ)} | ℓ ≥ 1 | Product of all higher-level local variances including \(\lambda _{\pi (ij|\ell )}^{2}\) |

Global Variances | ||

\(\tau _{\beta }^{2}\) | ℓ = 0 | Global variance across all product elasticities β _{ij} |

\(\tau _{\ell }^{2}\) | ℓ ≥ 1 | Global variance across all level-ℓ elasticities 𝜃 _{π(ij|ℓ)} |

### 3.3 Choice of mixing densities

_{ℓ}. As discussed in Section 2.2, the tails of the associated densities g

_{ℓ}(⋅) will play a key role in shaping the shrinkage imposed by the prior. Although there are now many possible choices for g

_{ℓ}(⋅), we confine our attention to three forms of shrinkage: (i) ridge, where the mixing density is a degenerate distribution and local variances are fixed to one; (ii) lasso, with an exponential mixing density; and (iii) horseshoe, with a half-Cauchy mixing density. These three types of priors remain common choices in empirical work and have very different shrinkage profiles (Bhadra et al., 2019).

_{ℓ}(⋅) at higher levels in the tree affect shrinkage behavior at the product level, which we address in the following section.

### 3.4 Some theory on shrinkage properties

_{ij}. Because global-local priors are scale mixtures of normals, the heaviness of the tails of this marginal prior will be determined by the tails of the mixing density (Barndorff-Nielsen et al., 1982). However, in our setting this analysis is complicated by the fact that the marginal prior for β

_{ij}will depend on multiple mixing densities in the hierarchical global-local structure.

_{ij}still retains a scale mixtures of normals representation and so the mixing densities will continue to play a key role in shaping the shrinkage profile for β

_{ij}. Second, we show that if the same heavy-tailed mixing density is specified at each level of the tree, then its heaviness will be preserved under the hierarchical product structure that we impose on local variances. Finally, we show that even if a combination of heavy and light-tailed mixing densities are specified across different levels, the heavy-tailed mixing densities will ultimately dominate and shape the product-level shrinkage profile.

_{ij}to 𝜃

_{π(ij|m)}, which is the shrinkage of the product-level elasticity to its level-m parent elasticity. This type of “vertical” shrinkage allows us to assess how quickly product-level elasticities can be pulled towards higher-level elasticities. Here we can write β

_{ij}as a function of its parent mean and error term:

_{ij}as a function of its level-m parent elasticity and a sum of errors across levels:

_{π(ij|ℓ)}, which is in turn a product of \(\lambda _{\pi (ij|\ell )}^{2}, \dots , \lambda _{\pi (ij|L-1)}^{2}\). Thus, the local variances will be a sum of products and if there is a value of s (for s ≤ m) for which \(\lambda _{\pi (ij|s)}^{2}\) is “small” then β

_{ij}will tend to be very close to 𝜃

_{π(ij|m)}.

_{ij}to \(\beta _{i^{\prime }j^{\prime }}\) (for i≠j, \(i^{\prime }\ne j^{\prime }\), and \(i\ne i^{\prime }\)), which is the shrinkage between two product-level elasticities. This type of “horizontal” shrinkage allows us to assess the extent to which elasticities become more similar as they become closer in the tree. Formally define \(m^{\star }=\min \limits \{m:\pi (ij|m)=\pi (i^{\prime }j^{\prime }|m)\}\) to be the lowest level in the tree such that all four products (\(i,j,i^{\prime },j^{\prime }\)) share a common ancestor (i.e., belong to the same group at some level in the tree), where m

^{⋆}= L if no common parent node exists for within the tree. Then we can write

^{⋆}− 1 − ℓ and so the variance of differences will be tend to be larger if m

^{⋆}is larger (i.e., the products are less similar). The form of the variances in Eq. 12 implies that if Ψ

_{π(ij|m)}is “small” on level m then Ψ

_{π(ij|s)}and \({\Psi }_{\pi (i^{\prime }j^{\prime }|s)}\) will tend to be small for s < m and so the variance of the difference will tend to be smaller further down the tree. This allows shrinkage to propagate down the tree with subsequent sub-categorizations of products tending to have similar cross-elasticities.

_{ij}and the differences \((\beta _{ij} - \beta _{i^{\prime }j^{\prime }})\) can be expressed as normal scale mixtures and so, like in sparse signal detection settings, the shape of the marginal prior will again be determined by the mixing density (Barndorff-Nielsen et al., 1982). However, while there is only one mixing density in traditional regression priors, the marginal priors for β

_{ij}and \((\beta _{ij} - \beta _{i^{\prime }j^{\prime }})\) involve a “scaled sum of products” transformation over many mixing densities. It is therefore not clear whether the heaviness of the mixing density specified level ℓ is: (i) preserved under the scaled sum of products transformations; or (ii) tarnished by mixing densities with lighter tails at higher levels in the tree. We clarify both points in the following two propositions.

_{s}has a half-Cauchy prior then each \({\lambda _{s}^{2}}\) is an inverted-beta random variable with density g(λ

^{2}) ∝ (λ

^{2})

^{− 1/2}(1 + λ

^{2})

^{− 1}(Polson & Scott, 2012b), which is regularly varying with index -3/2 and so \({\lambda _{s}^{2}}\) is regularly varying with index 1/2 (Bingham et al., 1987). Then by Proposition 1, the sum of products would also have regularly varying tails, and the different forms of shrinkage in Eqs. 10 and 12 will all have tails of the same heaviness as a standard horseshoe prior.

_{ℓ}isregularlyvaryingwithindexα} be non-empty. If there exists an 𝜖 such that G

_{s}has a finite α + 𝜖 moment for all s∉RV then ζ is regularly varying with index α.

_{ℓ}has a finite α + 𝜖 moment. Since, by assumption, at least one of \({\lambda _{1}^{2}},\dots , {\lambda _{L}^{2}}\) is regularly varying then at least one \({\Psi }_{1},\dotsc ,{\Psi }_{L}\) is regularly varying while the others are guaranteed to have a finite α + 𝜖 moment. Therefore, the closure properties of regularly varying random variables implies that ζ is also regularly varying with index α (Bingham et al., 1987). □

^{2}= 1) and an exponential distribution (i.e., lasso shrinkage). Therefore, heavy tails at any level of the tree are all that is required to get sparsity shrinkage at for the product-level elasticities. We explore different combinations of shrinkage in the simulations and empirical applications below.

## 4 Posterior computation

_{i}is the n × 1 vector of log sales for product i, X is the n × p matrix of log prices, β

_{i}is the p × 1 vector of own and cross-price elasticities associated with product i, and C

_{i}is a n × d matrix of control variables with coefficients ϕ

_{i}. In vector form, we have

^{2}× p

^{2}prior covariance matrix for β as \({\Lambda }_{*}=\tau _{\beta }^{2}\text {diag}(\text {vec}({\Lambda }))\), where Λ is a p × p matrix of local variances Ψ

_{ij}as defined in Eq. 5. Note that for a standard global-local prior, the (i,j)th element of Λ would be \(\lambda _{ij}^{2}\). We place \(\text {N}(\bar {\phi },A_{\phi }^{-1})\) priors on the control variable coefficients, which are conditionally conjugate to the normal likelihood given Σ. Inverse Wishart priors are commonly used for covariance matrices in Bayesian SUR models, however if p > n then Σ will be rank deficient. One approach would be to also regularize Σ (Li et al., 2019; Li et al., 2021). We instead impose a diagonal restriction \({\Sigma }=\text {diag}({\sigma _{1}^{2}},\dotsc ,{\sigma _{p}^{2}})\) and place independent IG(a,b) priors on each \({\sigma _{j}^{2}}\).

^{2}and global variances τ

^{2}can also be sampled independently, but the form of their respective posteriors will depend on the choice of prior. Under ridge shrinkage, each λ

^{2}= 1 so no posterior sampling is necessary. Under lasso shrinkage, each λ

^{2}follows an independent exponential distribution and so the full conditionals of 1/λ

^{2}have independent inverse Gaussian distributions (Park & Casella, 2008). Under horseshoe shrinkage, each λ follows an independent half-Cauchy distribution. We follow Makalic and Schmidt (2015) and represent the half-Cauchy as a scale mixture of inverse gammas, which is conjugate to the normal density so the target full conditional can be sampled from directly. Details are provided in Appendix A.1.

^{2}× p

^{2}matrix which is computationally expensive when p is large. We therefore present two strategies to facilitate scalability in the following subsections.

### 4.1 Diagonal restriction on Σ

^{2}× p

^{2}matrix and any sampler that directly inverts this matrix will be hopeless for large p. For example, even a sampler that calculates the inverse using Cholesky decompositions has complexity \(\mathcal {O}(p^{6})\). If instead Σ is assumed to be diagonal then both \(\tilde {\text {X}}^{\prime }\tilde {\text {X}}\) and \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) will have block diagonal structures, with each of the p blocks containing an p × p matrix. Computing the inverse of \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) then amounts to inverting each p × p block, which has computational complexity \(\mathcal {O}(p^{4})\) using Cholesky decompositions. While this is better than inverting \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) directly, it can still be prohibitively expensive for large p.

### 4.2 Fast sampling normal scale mixtures

^{2}× p

^{2}precision matrix \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\). This also shows that the computational gains are largest when p is much larger than n.

### 4.3 Scalability

^{2}× p

^{2}precision matrix \((\tilde {\text {X}}^{\prime }\tilde {\text {X}}+{\Lambda }_{*}^{-1})\) via Cholesky decompositions (see, e.g., chapters 2.12 and 3.5 of Rossi et al., 2005). In both cases we assume Σ is diagonal. The samplers are coded in Rcpp (Eddelbuettel and François, 2011) and run on a MacBook Pro laptop with 32GB of RAM and an Apple M1 Max processor. Figure 2 plots the computation time in log seconds against the number of products p. We find that the fast sampler offers significant computational savings: it is roughly two times faster when p = 200, 10 times faster when p = 500, and 30 times faster when p = 1000.

## 5 Simulation experiments

_{ij}is generated from a three-level hierarchical prior. The top-level coefficients 𝜃

_{π(ij|2)}are sampled from a uniform distribution over the interval {-3,-1}∪{1,3} and we fix \(\lambda _{\pi (ij|\ell )}^{2}=1\) and \(\tau _{\ell }^{2}=1\) for all (i,j) and across all levels. The middle-level coefficients 𝜃

_{π(ij|1)}and product elasticities β

_{ij}are generated through the model outlined in Section 3.2. In this specification, all pairs of goods have a non-zero cross-price elasticity.

_{ij}is generated from a three-level hierarchical prior where 75% of the product-level local variances Ψ

_{ij}are set to zero so that the corresponding product-level elasticity β

_{ij}is exactly equal to its prior mean. Thus, all pairs of product groups have a non-zero cross-price elasticity and many product-level elasticities are exactly equal to the group-level elasticity. This creates a structure where β appears dense, but is sparse after subtracting off the prior mean parameters 𝜃

_{π(ij|1)}.

_{ij}is generated from a three-level hierarchical prior where 75% of the level-1 coefficients 𝜃

_{π(ij|1)}and local variances Ψ

_{π(ij|1)}are both set to zero. This allows the product-level elasticities to inherit their sparsity from higher levels in the tree, which in turn produces blocks of cross elasticities that are either all dense or exactly equal to zero.

_{ij}is generated from a non-hierarchical prior with 95% of all elasticities set to zero.

_{i}= 0, define \({\Sigma }=\text {diag}({\sigma _{1}^{2}},\dotsc ,{\sigma _{p}^{2}})\) with \({\sigma _{j}^{2}}=1\), and generate elements of data matrices X and C

_{j}from a N(0,1) distribution. We generate 25 data sets from each specification. Examples of the resulting elasticity matrices are shown in Fig. 3.

Estimation RMSE | Correct Signs | |||||||

Data: n = 50, p = 100 | (I) | (II) | (III) | (IV) | (I) | (II) | (III) | (IV) |

Standard Shrinkage | ||||||||

β-Ridge | 2.541 | 2.375 | 1.219 | 0.492 | 0.77 | 0.78 | 0.87 | 1.00 |

β-Lasso | 2.622 | 2.473 | 1.079 | 0.314 | 0.76 | 0.77 | 0.89 | 1.00 |

β-Horseshoe | 2.879 | 2.745 | 0.969 | 0.098 | 0.73 | 0.74 | 0.88 | 1.00 |

Hierarchical Shrinkage | ||||||||

𝜃-Ridge, β-Ridge | 1.033 | 0.530 | 0.515 | 0.490 | 0.94 | 0.97 | 0.94 | 1.00 |

𝜃-Ridge, β-Lasso | 1.047 | 0.461 | 0.451 | 0.313 | 0.93 | 0.98 | 0.95 | 1.00 |

𝜃-Ridge, β-Horseshoe | 1.126 | 0.385 | 0.407 | 0.102 | 0.92 | 0.98 | 0.95 | 1.00 |

𝜃-Horseshoe, β-Ridge | 1.044 | 0.526 | 0.190 | 0.472 | 0.93 | 0.98 | 0.98 | 1.00 |

𝜃-Horseshoe, β-Lasso | 1.053 | 0.461 | 0.192 | 0.310 | 0.93 | 0.98 | 0.98 | 1.00 |

𝜃-Horseshoe, β-Horseshoe | 1.130 | 0.386 | 0.270 | 0.100 | 0.92 | 0.98 | 0.97 | 1.00 |

Data: n = 100, p = 300 | (I) | (II) | (III) | (IV) | (I) | (II) | (III) | (IV) |

Standard Shrinkage | ||||||||

β-Ridge | 2.938 | 2.743 | 1.275 | 0.545 | 0.72 | 0.73 | 0.83 | 0.99 |

β-Lasso | 2.972 | 2.790 | 1.232 | 0.427 | 0.72 | 0.73 | 0.84 | 1.00 |

β-Horseshoe | 3.225 | 3.045 | 1.185 | 0.068 | 0.69 | 0.70 | 0.84 | 1.00 |

Hierarchical Shrinkage | ||||||||

𝜃-Ridge, β-Ridge | 1.162 | 0.585 | 0.530 | 0.544 | 0.94 | 0.98 | 0.93 | 0.99 |

𝜃-Ridge, β-Lasso | 1.170 | 0.547 | 0.488 | 0.427 | 0.93 | 0.98 | 0.94 | 1.00 |

𝜃-Ridge, β-Horseshoe | 1.253 | 0.503 | 0.471 | 0.069 | 0.93 | 0.99 | 0.94 | 1.00 |

𝜃-Horseshoe, β-Ridge | 1.169 | 0.586 | 0.170 | 0.543 | 0.93 | 0.98 | 0.98 | 0.99 |

𝜃-Horseshoe, β-Lasso | 1.172 | 0.548 | 0.174 | 0.426 | 0.93 | 0.98 | 0.98 | 1.00 |

𝜃-Horseshoe, β-Horseshoe | 1.255 | 0.508 | 0.233 | 0.068 | 0.93 | 0.99 | 0.97 | 1.00 |

## 6 Empirical application

### 6.1 Data

Category | Subcategories | No. of Products | Share of Revenue |
---|---|---|---|

BEER/ALE | Domestic Beer/Ale | 62 | 16.1 |

Imported Beer/Ale | |||

CARBONATED BEVERAGES | Low Calorie Soft Drinks | 30 | 19.1 |

Regular Soft Drinks | |||

Seltzer/Tonic/Club Soda | |||

COFFEE | Ground Coffee | 27 | 8.5 |

Ground Decaffeinated Coffee | |||

Instant Coffee | |||

Single Cup Coffee | |||

Whole Coffee Beans | |||

COLD CEREAL | Ready-to-Eat Cereal | 53 | 9.3 |

FZ DINNERS/ENTREES | Fz Handheld Entrees | 36 | 7.8 |

Multi-Serve Fz Dinners | |||

Single-Serve Fz Dinners | |||

FZ PIZZA | Fz Pizza | 7 | 5.0 |

MILK | Rfg Almond Milk | 10 | 12.9 |

Rfg Flavored Milk/Eggnog/Buttermilk | |||

Rfg Skim/Lowfat Milk | |||

Rfg Soy Milk | |||

Rfg Whole Milk | |||

SALTY SNACKS | Cheese Snacks | 35 | 13.4 |

Corn Snacks | |||

Other Salted Snacks | |||

Potato Chips | |||

Pretzels | |||

Ready to Eat Popcorn/Caramel Corn | |||

Tortilla/Tostada Chips | |||

YOGURT | Rfg Yogurt | 15 | 7.9 |

Total Count = 9 | 28 | 275 | 100% |

### 6.2 Models

### 6.3 Results

#### 6.3.1 Predictive fit

Extrapolated | Limited | |||||
---|---|---|---|---|---|---|

All Products | Price Levels | Price Variation | ||||

Mean | SD | Mean | SD | Mean | SD | |

Standard Shrinkage | ||||||

β-Ridge | 0.846 | (0.126) | 0.871 | (0.190) | 0.968 | (0.347) |

β-Lasso | 0.847 | (0.117) | 0.865 | (0.181) | 0.960 | (0.331) |

β-Horseshoe | 0.895 | (0.170) | 0.937 | (0.258) | 1.085 | (0.449) |

Hierarchical Shrinkage | ||||||

𝜃-Ridge, β-Ridge | 0.808 | (0.104) | 0.816 | (0.148) | 0.895 | (0.276) |

𝜃-Ridge, β-Lasso | 0.814 | (0.113) | 0.824 | (0.170) | 0.909 | (0.314) |

𝜃-Ridge, β-Horseshoe | 0.899 | (0.131) | 0.889 | (0.194) | 1.004 | (0.347) |

𝜃-Horseshoe, β-Ridge | 0.842 | (0.119) | 0.825 | (0.168) | 0.919 | (0.307) |

𝜃-Horseshoe, β-Lasso | 0.823 | (0.117) | 0.821 | (0.169) | 0.908 | (0.308) |

𝜃-Horseshoe, β-Horseshoe | 0.993 | (0.159) | 0.845 | (0.162) | 0.902 | (0.305) |

#### 6.3.2 Product-level elasticities

_{ii}and β

_{ij}. We also report the share of own elasticities that are negative and the share of own elasticities that are significant at the 5% level. Complete distributions of price elasticity and promotion effect estimates are shown in Appendix C.3. Elasticity estimates are markedly different across prior specifications. Starting with own elasticities, we find that standard priors produce distributions of own elasticities where roughly 84% of estimates are negative, 21% of estimates are significant, and the average (median) own elasticity is around -1.2 (-0.9). In contrast, hierarchical priors produce distributions of own elasticity estimates where 93% are negative, 50% are significantly away from zero, and the average (median) own elasticity is around -1.5 (-1.4). We believe that being able to produce more economically reasonable and precise own elasticity estimates by simply shrinking to higher-level elasticities rather than zero is a strength of our approach.

Own Elasticities β _{ii} | Cross Elasticities β _{ij} | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

Neg | Sig | Mean | 10th | 50th | 90th | Mean | 10th | 50th | 90th | |

Standard Shrinkage | ||||||||||

β-Ridge | 84.0 | 25.8 | -1.20 | -2.97 | -1.09 | 0.41 | -0.000 | -0.000 | -0.000 | 0.000 |

β-Lasso | 83.6 | 21.5 | -1.13 | -2.95 | -0.86 | 0.28 | -0.000 | -0.000 | -0.000 | 0.000 |

β-Horseshoe | 85.1 | 16.7 | -1.23 | -3.38 | -0.82 | 0.26 | -0.001 | -0.004 | -0.000 | 0.003 |

Hierarchical Shrinkage | ||||||||||

𝜃-Ridge, β-Ridge | 93.5 | 41.5 | -1.58 | -3.01 | -1.53 | -0.28 | -0.004 | -0.023 | -0.004 | 0.012 |

𝜃-Ridge, β-Lasso | 94.2 | 42.9 | -1.55 | -2.87 | -1.47 | -0.27 | -0.004 | -0.024 | -0.004 | 0.016 |

𝜃-Ridge, β-Horseshoe | 89.5 | 49.5 | -1.42 | -2.64 | -1.40 | 0.09 | -0.001 | -0.121 | -0.002 | 0.114 |

𝜃-Horseshoe, β-Ridge | 97.5 | 54.9 | -1.63 | -2.88 | -1.48 | -0.59 | -0.004 | -0.020 | -0.001 | 0.013 |

𝜃-Horseshoe, β-Lasso | 97.1 | 61.1 | -1.57 | -2.77 | -1.46 | -0.64 | -0.004 | -0.019 | -0.003 | 0.010 |

𝜃-Horseshoe, β-Horseshoe | 87.6 | 49.5 | -1.30 | -2.56 | -1.14 | 0.05 | -0.008 | -0.086 | -0.004 | 0.071 |

_{ij}with a potentially non-zero prior mean leads to a build-up of mass away from zero when shrinkage in strong. Differences in the shape of cross-elasticity estimates can also be seen in the histograms reported in Appendix C.3.

#### 6.3.3 Higher-level elasticities

Largest (Most Positive) Elasticity | Smallest (Most Negative) Elasticity | |||||
---|---|---|---|---|---|---|

Category | 1 | BEER/ALE | BEER/ALE | 0.014 | SALTY SNACKS | -0.017 |

2 | CARBONATED BEVERAGES | CARBONATED BEVERAGES | 0.015 | SALTY SNACKS | -0.020 | |

3 | COFFEE | COFFEE | 0.009 | BEER/ALE | -0.008 | |

4 | COLD CEREAL | CARBONATED BEVERAGES | 0.007 | MILK | -0.023 | |

5 | FZ DINNERS/ENTREES | FZ DINNERS/ENTREES | 0.004 | CARBONATED BEVERAGES | -0.019 | |

6 | FZ PIZZA | FZ PIZZA | 0.002 | COFFEE | -0.006 | |

7 | MILK | MILK | 0.005 | COLD CEREAL | -0.010 | |

8 | SALTY SNACKS | FZ DINNERS/ENTREES | 0.002 | MILK | -0.015 | |

9 | YOGURT | FZ PIZZA | 0.003 | SALTY SNACKS | -0.010 | |

Subcategory | 1 | Domestic Beer/Ale | Domestic Beer/Ale | 0.040 | Ground Decaffeinated Coffee | -0.046 |

2 | Imported Beer/Ale | Domestic Beer/Ale | 0.050 | Tortilla/Tostada Chips | -0.041 | |

3 | Low Calorie Soft Drinks | Regular Soft Drinks | 0.034 | Tortilla/Tostada Chips | -0.033 | |

4 | Regular Soft Drinks | Regular Soft Drinks | 0.024 | Tortilla/Tostada Chips | -0.033 | |

5 | Seltzer/Tonic Water/Club Soda | Low Calorie Soft Drinks | 0.022 | Corn Snacks | -0.021 | |

6 | Ground Coffee | Ground Coffee | 0.030 | Domestic Beer/Ale | -0.016 | |

7 | Ground Decaffeinated Coffee | Single Serve Fz Dinners/Entrees | 0.013 | Domestic Beer/Ale | -0.020 | |

8 | Instant Coffee | Whole Coffee Beans | 0.008 | Rfg Yogurt | -0.019 | |

9 | Single Cup Coffee | Instant Coffee | 0.010 | Domestic Beer/Ale | -0.011 | |

10 | Whole Coffee Beans | Single Serve Fz Dinners/Entrees | 0.014 | Ready-to-Eat Cereal | -0.017 | |

11 | Ready-to-Eat Cereal | Regular Soft Drinks | 0.017 | Rfg Flavored Milk/Eggnog/Butterm | -0.049 | |

12 | Fz Handheld Entrees | Domestic Beer/Ale | 0.009 | Regular Soft Drinks | -0.031 | |

13 | Multi Serve Fz Dinners/Entrees | Single Serve Fz Dinners/Entrees | 0.023 | Low Calorie Soft Drinks | -0.028 | |

14 | Single Serve Fz Dinners/Entrees | Single Serve Fz Dinners/Entrees | 0.041 | Regular Soft Drinks | -0.030 | |

15 | Fz Pizza | Ready-to-Eat Popcorn/Caramel Cor | 0.006 | Single Serve Fz Dinners/Entrees | -0.018 | |

16 | Rfg Almond Milk | Rfg Yogurt | 0.015 | Ready-to-Eat Cereal | -0.010 | |

17 | Rfg Flavored Milk/Eggnog/Butterm | Rfg Flavored Milk/Eggnog/Butterm | 0.008 | Ready-to-Eat Cereal | -0.016 | |

18 | Rfg Skim/Lowfat Milk | Rfg Whole Milk | 0.008 | Ready-to-Eat Cereal | -0.019 | |

19 | Rfg Soy Milk | Rfg Flavored Milk/Eggnog/Butterm | 0.007 | Ready-to-Eat Cereal | -0.013 | |

20 | Rfg Whole Milk | Rfg Whole Milk | 0.007 | Domestic Beer/Ale | -0.014 | |

21 | Cheese Snacks | Domestic Beer/Ale | 0.002 | Rfg Almond Milk | -0.017 | |

22 | Corn Snacks | Single Serve Fz Dinners/Entrees | 0.007 | Rfg Yogurt | -0.015 | |

23 | Other Salted Snacks | Single Serve Fz Dinners/Entrees | 0.012 | Rfg Flavored Milk/Eggnog/Butterm | -0.019 | |

24 | Potato Chips | Potato Chips | 0.010 | Rfg Flavored Milk/Eggnog/Butterm | -0.019 | |

25 | Pretzels | Fz Handheld Entrees | 0.010 | Rfg Whole Milk | -0.015 | |

26 | Ready-to-Eat Popcorn/Caramel Cor | Single Serve Fz Dinners/Entrees | 0.014 | Rfg Soy Milk | -0.016 | |

27 | Tortilla/Tostada Chips | Single Serve Fz Dinners/Entrees | 0.008 | Rfg Whole Milk | -0.020 | |

28 | Rfg Yogurt | Potato Chips | 0.011 | Ready-to-Eat Cereal | -0.022 |

#### 6.3.4 Shrinkage factors

_{ij}→ 0 and the posterior mean of β

_{ij}converges to \(\hat {\beta }_{ij}\); when the noise dominates, then κ

_{ij}→ 1 and the posterior mean of β

_{ij}converges to the prior mean.

_{ij}and then report summary statistics across the distribution of estimates. The first finding is that there is a sizable difference in the strength of shrinkage between own and cross-price elasticities. We find that estimates of \(\tau _{\beta \text {cross}}^{2}\) tend to be four to five orders of magnitude smaller than \(\tau _{\beta \text {own}}^{2}\). Consequently, estimates of κ

_{ij}(for i≠j) tend to be bunched at one while estimates of κ

_{ii}are more dispersed throughout the unit interval. One explanation for this difference in shrinkage is that retail scanner data tends to exhibit a stronger signal for estimating own elasticities than cross elasticities (Hitsch et al., 2021). Our estimation problem is also high-dimensional as we are estimating 75,350 cross elasticity parameters from 78 weeks of training data.

Own Elasticities | Cross Elasticities | |||||||
---|---|---|---|---|---|---|---|---|

κ _{ii} | κ _{ij} | |||||||

\(\tau _{\beta \text {own}}^{2}\) | Min | Mean | Max | \(\tau _{\beta \text {cross}}^{2}\) | Min | Mean | Max | |

Standard Shrinkage | ||||||||

β-Ridge | 5.11 | 0.01 | 0.15 | 1.00 | 2.51E-06 | 1.00 | 1.00 | 1.00 |

β-Lasso | 5.07 | 0.01 | 0.19 | 1.00 | 1.13E-05 | 1.00 | 1.00 | 1.00 |

β-Horseshoe | 7.33 | 0.00 | 0.19 | 1.00 | 6.84E-06 | 0.01 | 1.00 | 1.00 |

Hierarchical Shrinkage | ||||||||

𝜃-Ridge, β-Ridge | 1.77 | 0.02 | 0.29 | 1.00 | 9.63E-05 | 1.00 | 1.00 | 1.00 |

𝜃-Ridge, β-Lasso | 1.82 | 0.02 | 0.35 | 1.00 | 3.15E-05 | 1.00 | 1.00 | 1.00 |

𝜃-Ridge, β-Horseshoe | 0.11 | 0.03 | 0.76 | 1.00 | 1.12E-05 | 0.01 | 1.00 | 1.00 |

𝜃-Horseshoe, β-Ridge | 0.21 | 0.02 | 0.49 | 1.00 | 6.79E-06 | 0.03 | 0.99 | 1.00 |

𝜃-Horseshoe, β-Lasso | 0.42 | 0.02 | 0.58 | 1.00 | 3.81E-05 | 0.02 | 1.00 | 1.00 |

𝜃-Horseshoe, β-Horseshoe | 0.08 | 0.01 | 0.76 | 1.00 | 1.51E-06 | 0.00 | 1.00 | 1.00 |

_{ii}is 0.19 for the β-Horseshoe model but 0.76 for the (β-Ridge, β-Horseshoe) model. If the shrinkage points are misspecified (as appears to be the case for standard priors with a mean fixed at zero), then the prior variances will need to get larger to accommodate deviations from zero. Since the hierarchical priors center the product-level elasticities around more reasonable values, then the prior variance can get smaller and shrinkage will “kick in” for noisy estimates. For differences in shrinkage factors across product categories, see Appendix C.4 where we plot the empirical CDF of posterior medians of κ

_{ii}across both categories and models. We find appreciable variation in the strength of category-level shrinkage. While there is variation across models, we find that shrinkage tends to be heaviest in categories like BEER/ALE and CARBONATED BEVERAGES, and lightest in FZ PIZZA and SALTY SNACKS.

### 6.4 Discussion

#### 6.4.1 Retail prices and prior regularization

^{4}In many retail markets, for example, prices are notoriously “sticky” (Bils & Klenow, 2004) and exhibit limited variation over time—a feature that need not dissipate as more data are collected. With limited variation in prices, price coefficients will in turn be subject to heavy regularization. Analysts interested in using observational data to estimate price elasticities will almost always face a problem of weakly informative data, calling for more judicious prior choices.

#### 6.4.2 Interpretable market structure

^{5}As we have shown, log-linear demand models with appropriate forms of regularization can produce estimates of large elasticity matrices. Do different priors lead to appreciably different inferences about competition and market structure?