Scanner data are electronic transaction data that specify turnover and the number of items sold by barcodes, e.g., the Global Trade Article Number. These data are of particular value and interest to theorists and practitioners who wish to measure the Cost of Living Index or the Consumer Price Index, since their complete content makes it possible to compute any price index formula, including superlative indices or CES (Constant Elasticity Substitution) indices. Since the CES index requires the estimation of the elasticity of substitution, this paper focuses on verifying various methods of estimating this parameter based on scanner data. The paper considers both algebraic methods and methods based on the panel regression approach. The main achievement of the paper is the separation of the main factors that affect the estimated value of the elasticity of substitution, i.e., the type of data filter used and the level of data aggregation. The paper also verifies how the elasticity of substitution estimates affect the differences between the values of the CES indices based on these estimates.
Hinweise
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
The Consumer Price Index (CPI) is designed to provide an estimate of changes in the prices of goods and services that households acquire for consumption between two periods (International Labour Office 2004, 2020). Given that in the CPI methodology, the basket of goods is fixed, the CPI can be regarded as a Cost of Goods Index (COGI). However, over time, the US Congress embraced the view that the CPI should reflect cost changes to maintain a constant standard of living (see (Gordon 1999) or visit: https://www.gao.gov/products/GGD-00-50). Consequently, based on this alternative approach, the CPI evolved into a Cost of Living Index (COLI). Or, to put it another way, most economists and statisticians agree that COLI is a methodological foundation for constructing the CPI.
The COLI measures changes in the cost of living that result from changes in the price of goods and services, with the condition that other external influences that affect living standards remain unchanged. The COLI concept is consistent with an economic approach, i.e., it is assumed here that a consumer can maximise utility or minimise expenditure, subject to a budget constraint, to achieve a certain utility level.
Anzeige
Although the CPI is often treated as a proxy for the COLI, the latter contains the entire set of consumption goods and services consumed by the households from which they derive utility. Consequently, many differences separate the two. For instance, the COLI includes the goods and services received for free as in-kind transfers from governments or non-profit institutions, and it considers changes in environmental factors that affect consumers’ well-being. However, the most easily recognised technical distinction is that the CPI is constructed using a Laspeyres-type index (International Labour Office 2004) formula that does not reflect the potential for consumer substitution in response to relative price changes.
With “traditional” data collection, the only reasonable and pragmatic measure of inflation is the Laspeyres index (International Labour Office 2004). This is because price data mostly come from interviewers in the field, and consumption level data come from the Household Budget Survey, which is never up to date. However, for more than 30 years, alternative data sources, such as scanner data or web-scraped data, have come to the fore in measuring inflation (International Labour Office 2020). Scanner data are of particular value and interest to theorists and practitioners who wish to measure the COLI since their content generates virtually no restrictions when applying any price index formula. Scanner data are transaction data that specify turnover, and numbers of items sold by barcodes, e.g., Global Trade Article Number (GTIN), formerly known as the European Article Number (EAN), or Stock Keeping Unit (SKU). The data are characterised by a huge volume of products and contain complete information about items sold by retail chains at all their locations (outlets), e.g., information about prices and quantities at the barcode level, and product attributes, like the size, weight, sale unit, colour, or package quantity (Australian Bureau of Statistics 2016; Eurostat 2018).
The use of scanner data brings many benefits to statistical offices, as automating its acquisition and processing leads to savings in time and money. Nevertheless, the use of this type of data also has its limitations and methodological challenges. First, an IT system upgrade or the construction of an entirely new IT environment to handle this type of data is most often required. Further, the automatic classification of products into appropriate categories and their matching over time, require the use of advanced machine learning and text analysis methods (Białek and Berȩsewicz 2021). This, in turn, requires building a competent team of people to handle these processes. From the point of view of inflation measurement itself, we also have many challenges related to, for example, the high turnover of scanned products. The strong seasonality of products, new and disappearing goods with short life cycles make even the choice of an index formula problematic, not to mention about quality adjustments. The need to impute missing prices and thus the dilemma of choosing an imputation method, choosing the width of the time window when implementing multilateral indexes, or finally including products that have never before been observed in the dataset are just some of the potential problems (Eurostat 2018). And then there is the issue of the representativeness of the scanner data. Not all products in the CPI basket can be bought in a supermarket, and not every consumer shops in such places. Not all retailers or markets may have comprehensive scanner data available, and even when they do, there might be gaps or inconsistencies in the data due to various reasons such as technical glitches, data entry errors, or other operational issues. Additionally, scanner data, being transactional in nature, might not capture other relevant factors influencing price dynamics, such as promotional activities, loyalty programs, or other marketing strategies. Data from large supermarket chains might not capture the pricing dynamics in local mom-and-pop stores or online retailers. As a consequence, the lack of representativeness can lead to potential biases in the computed price indices on the basis of scanner data, making them less reflective of actual market conditions. Nevertheless, despite the mentioned difficulties, limitations and challenges, scanner data are gaining popularity and setting a new trend in inflation measurement.
As mentioned above, scanner data allow the calculation of any price index formula, including superlative indices (Diewert 1976), which are considered the best approximation of the COLI. Superlative indices (e.g., the Fisher index (Fisher 1922)) are unattainable in traditional data collection because there is no knowledge of current period consumption. Nevertheless, an interesting option is the ability to determine the CES (Constant Elasticity Substitution) index (International Labour Office 2004; Lloyd 1975; Moulton 1996), which uses the same consumption data as the Laspeyres index. However, it also approximates the superlative indices very well, provided that we accurately determine the elasticity of substitution (see Sect. 2.3). As suggested in Balk (2000), the value of the elasticity of substitution should be less than at a higher data aggregation level but greater than at a lower level. This paper evaluates the usefulness of scanner data in determining the elasticity of substitution for selected homogeneous scanner product groups. It also verifies the hypothesis that as the level of data aggregation increases, the elasticity of substitution decreases. The main purpose of the article is to evaluate selected methods of estimating the elasticity of substitution based on scanner data. It also detects the main factors that affect the quality of this estimation. There are few articles in the literature that examine the effectiveness of methods for estimating elasticities of substitution for scanner products. For example (Can 2022) examined elasticities of substitution for four groups of scanner products (soda, dairy, coffee, cheese) but it considered only one estimation method and only one data aggregation level. In the paper (Jeon et al. 2023), the impact of scanner data on demand elasticity estimates is examined. The same paper proposes methods to adjust these estimates for sales policy analyses. An interesting result from the paper (Jeon et al. 2023) is that scanner data generate statistically different elasticities (with more elastic demand) than other data types. The paper (Liesbeth et al. 2018), on the other hand, analyzes income elasticities for food, calories and nutrients using a regression approach, which is similar to what we called in the paper the panel regression approach (see Sect. 3.2). On the other hand, paper [?] leads a discussion on improving the method of estimating elasticity of substitution by approximating the Fisher index by using the CES index. This is the approach that we have labeled by M-LM method in this paper (see Sect. 3.1) and included among the algebraic methods.
Anzeige
This article, compared to the works cited above, expands the spectrum of methods used to estimate elasticity of substitution and additionally analyzes the impact of both the level of data aggregation and the data filters used on the final results of the elasticity of substitution estimates.
The article is organised as follows: Sect. 2 presents the theoretical background for the Cost of Living Index. Section 3 describes two groups of methods for estimating the elasticity of substitution: an algebraic approach and a panel regression approach. Section 4 empirically compares all discussed methods of estimating the elasticity of substitution. Section 5 lists the most important conclusions of the research.
2 Concept of the cost of living index
2.1 Theoretical background
The theory of the Cost of Living Index (COLI) for a single consumer (or household) was first developed by Konüs (1924), the Russian economist. In his economic approach, the period t quantity vector \(q^{t}=[q_{1}^{t}, q_{2}^{t},\ldots ,q_{n}^{t}]\) is determined by the consumer’s preference function f and the period t price vector \(p^{t}=[p_{1}^{t}, p_{2}^{t},\ldots ,p_{n}^{t}]\) that the consumer faces when observing n commodities or items. The consumer’s preferences over the given consumption vector q are assumed to be represented by a continuous, non-decreasing and concave utility function f (International Labour Office 2004). The main assumption here is that the consumer minimises the cost of achieving the period t utility level \(u^{t} \equiv f(q^{t})\) for the base period \(t=0\) and the current period \(t=1\) and consequently the consumer’s cost function is defined as follows:
The Konüs family of true cost of living indices that compare the current period with the base one, is defined as the ratio of minimum costs of achieving the same utility level \(u \equiv f(q)\):
where \(q=(q_{1},q_{2},\ldots ,q_{n})\) is a positive reference quantity vector. Assuming additionally that function f is linearly homogeneous, i.e., it satisfies \(f(\lambda q)=\lambda f(q)\) for any positive \(\lambda \), which is known in the economic literature as homothetic preferences, we get \(C(u,p)=uc(p)\), where the unit cost function is defined as \(c(p) \equiv C(1,p)\). The homothetic preferences assumption simplifies the family of true cost of living indices as follows (International Labour Office 2004):
Similarly, under the homothetic preferences assumption, the implicit quantity index that corresponds to the true cost of living index (3) is the utility ratio \(f(q^{1})/f(q^{0})\). Under all the above-mentioned assumptions, let us also assume that the consumer has the following quadratic mean of order r utility function:
$$\begin{aligned} f^{r}(q)=\root r \of {\sum _{i=1}^{n} \sum _{k=1}^{n} a_{ik}q_{i}^{r/2}q_{k}^{r/2}}. \end{aligned}$$
(4)
where the parameters \(a_{ik}\) satisfy \(a_{ik}=a_{ki}\) for any i and k, and \(r \ne 0\). Diewert (1976) demonstrate that the utility function \(f^{r}\) is a flexible functional form, i.e., it can approximate an arbitrary twice-differentiable, linearly homogeneous functional form to the second order.
Let us define the quadratic mean of order r quantity index as follows:
Since the quantity index \(Q^{r}\) is exact for the functional form \(f^{r}\), following to Diewerts’s terminology (see (Diewert 1976) and also earlier work (Fisher 1922)), \(Q^{r}\) is a superlative price index for each \(r \ne 0\). For each quantity index (5), we can define the corresponding implicit quadratic mean of order r price index as follows:
It can be demonstrated that the \(P_{im}^{r}\) price index is also a superlative index formula for each \(r \ne 0\). To estabilish this, it is necessary to prove its exactness for a flexible functional form \(c_{im}^{r}\), where \(c_{im}^{r}\) is the unit cost function that corresponds to the aggregator function \(f^{r}\) (International Labour Office 2004). In analogy to the quadratic mean of order r quantity index (5), one can also define the quadratic mean of order r price index:
$$\begin{aligned} P^{r}(p^{0},p^{1},q^{0},q^{1})=\frac{\root r \of {\sum _{i=1}^{n} s_{i}^{0} (\frac{p_{i}^{1}}{p_{i}^{0}}) ^ {r/2} }}{\root r \of {\sum _{i=1}^{n} s_{i}^{1} (\frac{p_{i}^{1}}{p_{i}^{0}}) ^ {-r/2}}}. \end{aligned}$$
(9)
It is also a superlative price index formula (Diewert 1976; International Labour Office 2004).
2.2 Proxies for the cost of living
In the economic and statistical literature, superlative indices are considered to be the best approximation of the Cost of Living Index (White 1999). From a theoretical standpoint, superlative indices should also best measure the CPI, which is used in practice to approximate the “true” COLI (von der Lippe 2007).
It is known (International Labour Office 2004) that \(P_{im}^{1}=P_{W}\) and \(P_{im}^{2}=P_{F}\), where \(P_{W}\) and \(P_{F}\) denote the Walsh price index (Walsh 1901) and the Fisher price index (Fisher 1922) respectively. It can also be shown that \(P^{r \rightarrow 0}=P_{T}\) and \(P^{2}=P_{F}\), where \(P_{T}\) denotes the Törnqvist price index (Törnqvist 1936). Since all the indices above are superlative, they can be considered good proxies for the COLI. With the notions introduced in Sect. 2.1, the formulas of these superlative indices can be expressed as follows:
where \(P_{La}\) and \(P_{Pa}\) denote the Laspeyres (1871) and Paasche (1874) price index respectively.
Considering the class of indices presented in (9) it was proven (Hill 2006) that all the commonly used superlative indices, i.e. \(P_{W}, P_{T}\) and \(P_{F}\), fall into the interval \(0 \le r \le 2\). It was also demonstrated that these superlative indices approximate each other (Diewert 1976). Thus, the remainder of the paper uses each index to determine the elasticity of substitution. It turns out that the choice of the superlative index in the “index” approach to determining the elasticity of substitution can make a difference (see Sects. 3.1 and 4).
2.3 CES model of consumer preferences and CES index
Let us consider the following unit cost function, which depends on prices and the additional parameter \(\sigma \):
The unit cost function defined in (13) corresponds to the constant elasticity of substitution (CES) aggregator function, which was introduced in Arrow et al. (1961).
Parameter \(\sigma \) of the CES cost function (13) corresponds to the elasticity of substitution in the following CES utility function:
The elasticity of a function of a single variable measures the percentage response of a dependent variable to a percentage change in the independent variable. By contrast, the elasticity of substitution between two inputs measures the percentage response of the relative marginal utility of the two goods to a percentage change in the ratio of their quantities (Saito 2012).
From Eq. (14), we conclude that the elasticity of substitution \(\sigma \) takes values from 0 to +\(\infty \). If \(\sigma \rightarrow +0\), the unit cost function (13) becomes linear and corresponds to the Leontief utility function which exhibits 0 substitutability between goods. Otherwise, if \(\sigma \rightarrow 1\) , the corresponding utility function is a Cobb-Douglas function. As \(\sigma \) tends to \(+\infty \) utility function (14) is characterised by substitutability between each pair of inputs (Diewert and Fox 2022).
The CES cost function does not have a fully flexible functional form (if the number of commodities exceed two) but it is frequently used to aggregate commodities in a group of goods which are thought to be highly substitutable with each other (International Labour Office 2004).
In the rest of the paper, we will refer to the price index based on the elasticity of substitution as the CES index. This index requires estimation because it contains a parameter in the body, which is in line with the literature (e.g. Lent and Dorfman (2009)). Let us first consider the Lloyd-Moulton index (Lloyd 1975; Moulton 1996), which has the following form for \(\sigma \ne 1\):
It can be proven (International Labour Office 2004) that the Lloyd-Moulton index defined in (15) is accurate for CES preferences under the assumption of consumer cost-minimising behaviour, i.e. it holds that
Note that \(P_{LM}(0)=P_{La}\) and that the Lloyd-Moulton price index (i.e., the CES index) use the same price and quantity information as the Paasche index (current-period weights are unnecessary), making it potentially of great practical value to statistical offices. With a correct estimate of the elasticity of substitution \(\sigma \), assuming that it does not change too rapidly over time, the CES index should be a good proxy for the COLI.
Another index in the literature that approximates COLI, but using weights only from the base period, is the AG Mean index (Lent and Dorfman 2009). A rolling procedure is used to determine what parameter is in the body of this index, which must be constantly updated with new data. However, despite its computational simplicity and certain advantages, this index will not be considered in the paper because it is not derived directly from the economic approach.
3 Estimates of the elasticity of substitution
Although there are many methods for estimating the elasticity of substitution, this paper uses two methods, or rather groups of methods, that appear most frequently in the literature (Ivancic et al. 2010; Haan et al. 2010). The first group compares bilateral indices, at least one of which depends on the elasticity of substitution. The second group uses a panel regression model that explains changes in the level of consumption through price changes based on the CES assumption. The first methods, which, following (Ivancic et al. 2010) we will refer to as the algebraic approach, compare only the current period with the base period in determining the elasticity of substitution. By contrast the second methods (the panel regression approach) consider the entire time interval between the base period and the current period.
3.1 The algebraic approach
The first method presented here was described by Haan et al. (2010), among others. Determining the elasticity of substitution for periods 0 and 1 involves comparing the CES index (15) and its "current weight counterpart" \(P_{CW}(\sigma )\), where:
Since the price index \(P_{LM}(\sigma )\) monotonically decreases and \(P_{CW}(\sigma )\) monotonically increases as \(\sigma \) increases, there is a unique value \(\sigma _{0}\) for which it holds that \(P_{LM}(\sigma _{0})=P_{CW}(\sigma _{0})\). Consequently, we have
where the right side of Eq. (18) means the quadratic mean of order r price index \(P^{r}(p^{0},p^{1},q^{0},q^{1})\) defined in (9) but for \(r=2(1-\sigma _{0})\). Thus, we conclude that for the designated \(\sigma _{0}\) value, the CES index is superlative and can be treated as a good approximation of the Cost of Living Index. However, that although we use the simplifying assumption that \(\sigma _{0}\) is constant over time (regardless of the temporal distance of periods 0 and 1), we can only determine this parameter numerically. In the remainder of this paper, we will conventionally denote the method of estimating elasticity of substitution described above with the notation M-LM.
Another method of estimating the elasticity of substitution is to "simply" compare the Lloyd-Moulton index to one of the superlative indexes (see e.g. (Ivancic et al. 2010)). In particular, solving numerically (relative to \(\sigma \)) equations: \(P_{LM}(\sigma )=P_{F}\), \(P_{LM}(\sigma )=P_{T}\), and \(P_{LM}(\sigma )=P_{W}\), we get methods labelled M-F, M-T and M-W respectively.
Balk (2000) suggested that the Sato-Vartia index (although not superlative—see Sato (1976); Vartia (1976)) can also be used to estimate the elasticity of substitution. The method of estimating the elasticity of substitution based on the Sato-Vartia index will be denoted by M-SV.
3.2 Panel regression approach
The theoretical foundation implies the notion of representative consumers. In other words, it is as if a single consumer buys all goods consumed in this market, substituting one good for another according to a rational decision resulting from optimising a utility function. This representative consumer is a price taker on the market. Although these hypotheses are somewhat unrealistic, the operating conditions are not very different when considering micro markets. To adopt a more realistic framework, we must work on a local market where consumers choose between really substitutable products (e.g., yoghurts sold in specific chain stores). In this market, the assumptions relative to the representative consumer are acceptable since it is possible to substitute a good for another. Moreover, for the consumer who consumes a good in a specific store, it is easier to substitute that good for another in this store rather than buy it in another store. Thus, an aggregate model of behaviour on the market of yoghurts sold in this store, if not the result of the aggregation of individual preferences, is nevertheless a plausible picture of how the micro-market works (Sillard 2012; Leclair et al. 2019).
One can apply a panel regression approach to extract the elasticities of substitution between quantities of goods under consideration and their prices. The most straightforward hedonic regression shows the relation between price (or log price) on the one hand side and product characteristics on the other-hand. Diewert (2003) develops sufficient conditions to allow a hedonic regression to be interpreted as a function of consumer preferences.
In the panel regression approach, we consider the demand function based on cost and utility functions (in logs) as follows:
The equation does not consider the household income level, the prices of other goods or the consumption structure. The unknown coefficient \(\beta _{0}\) captures the \(\sigma \) and \(\alpha \) parameters from Eq. (13). \(\epsilon _{it}\), and is the error term. The use of panel data implies the heterogeneity of the units analysed. Heterogeneity can be considered by fixed effects (FE) and captured by a parameter with dummy variables relating to the units. In the FE approach, individual effects can be thought of as different free expressions in the time series for individual units: they have the character of time-constant, deterministic parameters that vary for individual i. In contrast, unobserved heterogeneity in a random effects (RE) model can be captured differently. One assumption is that the free expression in the regression model varies randomly for the cross-section. Another assumption is that the random component of the regression equation is the sum of several components, including a random component that relates to a single unit. In demand analysis, the parameters of the distribution of this random component are interpreted as a random variation in preferences (Cameron and Trivedi 2010). Stated that heterogeneity needs attention in contemporary microeconomic research. In general, there is more heterogeneity than studies assume, and in addition, incorrect treatment of heterogeneity can lead to errors when estimating effects that are of interest to researchers (Browning and Carro 2007). The nature of these effects in a random effects model means that it is possible to introduce free expression into a random effects model, which was not the case with a fixed effects model. Individual effects are treated as deterministic, estimable parameters in the fixed effects model. In the random effects model, they are treated as random. As such, they do not constitute additional, potentially estimable parameters but extend the stochastic part of the model (Berry and Haile 2021).
An issue that should be addressed before proceeding is the potential endogeneity between prices and quantities (or expenditure). In the economic approach to index numbers, a consumer (or household) solves a utility maximisation problem. The consumer chooses a bundle of goods to maximise utility subject to some budget constraint. In this framework, prices are assumed to be exogenous. In essence, households regard the observed price data as given, while the quantity data are regarded as solutions to the various optimisation problems. This assumption is reasonable for our analysis as we use data on items that are bought in a store, where prices are generally taken as given by consumers (Ivancic et al. 2010).
However, it can be argued that prices are not exogenous. The primary assumption in those methods is that all prices \(p_i\) are exogenous. However, it is very likely they are endogenous (Freyberger 2015). One can imagine that the seller sets the price according to the reaction to the demand he expects. This generates a simultaneity bias when estimating the ordinary least square (OLS) and related methods (FE, RE). We then need an instrument to resolve this problem. The structural equation we are interested in is a demand equation. To identify this equation, we look for an exogenous instrument of the price in a demand equation (Birchall and Verboven 2022; Rilstone 1994). Adding the pricing equation clarifies questions about the optimal choice of instruments for our problem (Chamberlain 1987). The optimal instruments for the price will depend on the characteristics of rival products, but semiparametric analysis of optimal instruments is difficult, if not impossible (see (Newey 1990; Berry et al. 2004)). The correct choice in our case seems to be using variable related to production costs as instrumental variables and estimate parameters by two-stage least squares (2SLS) (Petrin and Train 2010; Beck and Lein 2020).
A synthetic comparison of the algebraic methods and the method using panel regression is summarized in Table 1.
Table 1
Comparison of properties of algebraic methods and a method using panel regression to estimate elasticity of substitution
Property
Algebraic approach
Panel regression approach
Calculation time and simplicity of implementation
Algebraic methods require less data than the panel regression approach because they only use data from the current and base periods. Consequently, algebraic methods are faster to estimate and simpler to implement
The panel approach requires a time-series sample, so we need more data here to estimate the elasticity of substitution than with algebraic methods. Parameter estimation here is slower and more complex, as it involves, among other things, choosing the right panel model and estimation method
Stability of results due to new data
When we expand the database to include data from the next month, algebraic methods can produce markedly different results from those previously determined (for the previous month). This phenomenon will be more pronounced the lower the level of data aggregation
Shifting the time window over which a panel regression model operates by one month usually does not lead to clear changes in the estimated elasticity of substitution (within the chosen and established model and estimation method)
Stability of results due to the choice of estimation method
Algebraic methods generally lead to much less varied values of the estimated elasticity of substitution than methods based on a panel regression approach
The results of estimating elasticities of substitution depend very much on the panel regression model and estimation method adopted
4 Empirical study
4.1 Scanner data sets description
In the following empirical study, we used scanner data from one retail chain in Poland, i.e., monthly data on yoghurt (COICOP 5: 011441) and other articles for personal hygiene (COICOP 5: 121322) sold in over 500 outlets between December 2021 and December 2022 (i.e., 604,260 and 364,984 records, which means 36.4 MB and 19.7 MB of data in the csv format, respectively). The COICOP 5 yoghurt group consisted of the following local COICOP 6 product subgroups: Actimel (19 IDs), chocolate and walnut yoghurt (6 IDs), natural joghurt (75 IDs), fruit yoghurt (119 IDs), drinking yoghurt (165 IDs), and the COICOP 5 other articles for personal hygiene group consists of the following local COICOP 6 product subgroups: tissuess (60 products: IDs), wet wipes (88 IDs), toilet paper (117 IDs), baby diapers (193 IDs), sanitary pads (20 IDs), sanitary napkins (67 IDs), and tampons (22 IDs).
Before calculating the elasticities of substitution, the data sets were carefully prepared. Products were classified using the data_selecting() and data_classification() functions from the PriceIndices R package (Białek 2021). The first function required the manual preparation of dictionaries of keywords and phrases that identified individual product groups. The second function was used for problematic, previously unclassified products and required manual preparation of learning samples based on historical data. The classification itself was based on machine learning using random trees and the XGBoost algorithm (Tianqi and Carlo 2016). Next, product matching was carried out based on the available GTIN (Global Trade Item Number) bar codes, internal retail chain codes and product labels. To match products, we used the data_matching() function from the PriceIndices package. To be more precise: products with two identical codes or one of the codes identical and an identical description were automatically matched. Products were also matched if they had identical one of the codes and the Jaro-Winkler distance (Jaro 1989) of their descriptions was smaller than the fixed precision value: 0.02.
4.2 Elasticity of substitution depending on the estimation method and data filter used
In the first phase of our empirical study, we set out to see how the measurement of elasticity of substitution is affected by the estimation method and the data filters used. The analysis included both the algebraic approach and panel regression approach (see Sect. 3). Three popular data filters were considered: the extreme price filter (F1), the dump price filter (F2) and the low sale filter (F3).
Fig. 1
Elasticity of substitution depending on data filters applied for yoghurt products
Fig. 2
Elasticity of substitution depending on data filters applied for other articles for personal hygiene
×
×
There is an ongoing discussion in the literature about whether or not to use data filters for scanner data, and if so, what kind of filters to apply. As a rule, scanner data indices are calculated using a dynamic approach, with most countries opting for the monthly chain Jevons index. This method is commonly referred to as the dynamic method (Eurostat 2018). The dynamic basket is determined using turnover figures of individual products in two adjacent months, i.e., the product is included in the sample if its turnover is above a fixed threshold determined by the number of products in a given product group. In van Loon and Roels (2018), the following condition for the above-mentioned rule can be met, which indicates whether the i-th product is taken into consideration when comparing months \(t-1\) and t:
where \({s_i^{t}}\) denotes the expenditure share of the i-th product at time t, n is the number of considered products and \(\lambda \) is a fixed parameter (usually and here set to 1.25). This kind of data filter can be called a low sale filter (F3). Those who support using filters also believe that products that display extreme price changes from one month to another should also be excluded from the sample (extreme price filter‐F1). In this study, the F1 filter eliminated products whose prices more than tripled or fell more than twice from month to month. As the list of possible data filters is longer. Statistics Belgium for example, implements a filter for products with a clearly decreasing price and substantially decreasing sales (dump price filter‐F2) (van Loon and Roels 2018). The F2 filter was also implemented in this survey, with the cutoff for dropping sales being \(-\)30% from month to month.
For the algebraic methods, the elasticities of substitution for successive months were determined for the two scanner datasets under consideration, with reference to a fixed base of December 2021. Observing the results for these methods (Figs. 1 and 2), the M-F method (based on the Fisher index) leads to a markedly different (and generally higher) elasticity of substitution results than the other methods. This is especially evident for the yoghurt dataset (see Fig. 1), where the differences (between the results of estimation by the M-F method versus the other methods that did not use filtering) exceeded 0.5 in June, 2022 (see Fig. 1 and Table 2). For other articles for personal hygiene, the differences in the elasticities of substitution where also close to 0.5 but this occurred in August 2022. This time, the largest values of the measured elasticity of substitution were generated by M-LM method (see Table 2). Thus, while it can be concluded that, in general, different algebraic methods lead to noticeably different values of elasticity of substitution, it is not possible to identify a method that always leads to the largest or smallest estimation results.
Additionally use of filters has an impact not only on the determined values of the elasticity of substitution, but also on the magnitude of differences created after applying different algebraic methods. For example, the F1 and F2 filters for the data set on yoghurt products (Fig. 1) lead to an even greater distancing of the estimation results obtained after applying algebraic methods.
Table 2
The biggest differences in CES values depending on the algebraic method and the data set (*)
Method
Yoghurt (June, 2022)
Other articles for personal hygiene (August, 2022)
M-LM
1.071625
1.2912369
M-F
1.579094
1.2478638
M-W
1.095161
0.7947540
M-T
1.077728
1.2885284
M-SW
1.005402
0.9116364
(*) the case with no filtering
Analogous results with panel regression approach discussed in Sect. panel regression approach discussed in Sect. 3.2 are presented in Table 3. Estimates of Eq. (13) are presented in that table hence when interpreting the sigma parameter, the sign of the parameter at log(price) variable should be changed to the opposite sign. Additionally, Appendix A includes analogous panel regression estimates using different panel regression methods. The 2SLS method seems to be the most valid in case of analysed groups of products because of endogeneity of prices (and using cost indexes as instrumental variables). Comparing 2SLS estimates with other methods (OLS, FE or RE) one can observed differences confirming that primary assumption that all prices are exogenous generates a simultaneity bias. The tests (F-test and Chi-sq), developed by Sanderson and Windmeijer (2016), suggest that the instrumental variables are valid (see (Sanderson and Windmeijer 2016)). The cluster option of standard errors takes heteroskedasticity into account. Focusing only on those results that apply to the most disaggregated case, i.e., results obtained at the GTIN level and signified in the figures by the description "before aggregation" (the impact of aggregation on the estimation results will be discussed in Sect. 4.3). After using the abovementioned filters, we found that the estimated elasticity of substitution for yoghurt products decreased after using filters, it changed from 1.8 to 1.3\(-\)1.4 regardless of the filter (see Table 3). Additionally, in the case of other articles for personal hygiene, we observed a decrease in the estimated sigma from 0.992 to approximately 0.81 after applying the dump price filter and the extreme price filter. In case of using the low prices filter the elasticity of substitution obtained by using the 2SLS method was approximately the same as in case of no filters.
Table 3
CES 2SLS estimates on yoghurts and other articles for personal hygiene using the different price filter
Filter
Without
Dump price
Extreme price
Low sale
Yoghurt
log(price)
\(-\)1.801***
\(-\)1.298***
\(-\)1.419***
\(-\)1.396***
(0.369)
(0.270)
(0.286)
(0.288)
SW Chi-sq
27.09
32.58
30.8
30.49
p-val
0.000
0.000
0.000
0.000
SW F-stat
27.02
32.48
30.71
30.4
p-val
0.000
0.000
0.000
0.000
Product fixed effect
Yes
Yes
Yes
Yes
Number of products
353
344
349
349
Number of observations
3521
3460
3497
3500
Other articles for personal hygiene
log(price)
\(-\)0.992***
\(-\)0.815***
\(-\)0.813***
\(-\)1.054***
(0.198)
(0.192)
(0.193)
(0.245)
SW Chi-sq
389.82
420.26
414.79
333.78
p-val
0.000
0.000
0.000
0.000
SW F-stat
389.09
419.45
413.98
333.02
p-val
0.000
0.000
0.000
0.000
Product fixed effect
Yes
Yes
Yes
Yes
Number of products
532
516
516
440
Number of observations
5311
5155
5152
3129
Cluster standard errors in parentheses; *** \(p<0.01\), ** \(p<0.05\), * \(p<0.1\); SW Chi-sq - Sanderson-Windmeijer first-stage chi-squared tests of underidentification SW F-stat - Sanderson-Windmeijer multivariate F test of excluded instruments
4.3 Impact of the data aggregation level on CES estimates
The next stage of the study assessed the impact of data aggregation on the elasticity of substitution as measured by various methods. Again, both the algebraic and panel regression-based methods were considered (see Appendix). By the term "before aggregation" we mean the most disaggregated level of data, i.e., the GTIN code level. By contrast, the term "after aggregation" means that we define a homogeneous product at the COICOP 6 group level.
Table 8 presents the elasticities of substitution calculated using algebraic methods before and after aggregation for the two data sets and for two variants, i.e., one where elasticities of substitution were calculated by comparing December 2022 to December 2021, and another where the elasticities were computed for each pair of subsequent months, and the means of these elasticities were compared. Regardless of the method of calculation, the estimated elasticities of substitution are noticeably smaller after data aggregation (as a rule, they are more than twice as small). This is particularly noticeable in the case of other articles for personal hygiene, where the average values of the elasticities of substitution dropped more than fourfold when moving from the GTIN code level to the COICOP 6 level (see Table 8).
With the panel regression approach, the parameters of Eq. (19) were estimated using several methods to determine the value of the sigma parameter and to check whether aggregation within individual ID codes or product groups influences the estimated magnitude of the sigma parameter of interest. The estimation used both standard methods for panel data, i.e., panel methods with fixed effects (FE) and random effects (RE); see Sect. 3.2 for a description. We have assumed that price need not be an exogenous variable; hence the two-stage method we have used, using instruments correlated with the explanatory variable. Regarding the yoghurt estimates, we use the price index of the consumer goods milk, cheese, and eggs as the instrument. For other articles for personal hygiene, our instrument is the price index of sold production in manufacturing paper and paper products. Both instruments are monthly single-basis indices. The instruments we used to approximate the development of production costs in the branches. Both instruments used in the regression equations are valid (Stock and Wright 2000; Stock and Yogo 2002; Cragg and Donald 1993). Additionally, to capture the heterogeneity of the products analysed within a given group, we use statistics to heteroskedasticity and clustering on products and groups (2SLS). Furthermore, we supplement our analysis following other researchers with estimates of the equation on first-difference increments (FD). However, in this case, the estimates are inferior and are also based on a shorter time series. For the yoghurts group (Table 4), the estimated sigma for the traditional FE and RE methods is underestimated due to endogeneity. Using instrumental shows causes the estimated sigma parameter to increase to 1.8. Aggregation caused the sigma parameter to be estimated on a relatively small sample (13 periods and 5 groups), which is additionally quite heterogeneous; hence, the estimation of the joint sigma is statistically insignificant (Table 6). We obtained slightly lower (in terms of modulus) sigma parameter estimates for other articles for personal hygiene. Our estimated parameter with instruments in the 2SLS method is 0.992 (Table 3). The post-aggregation estimates are significantly lower (Table 7).
4.4 Impact of the estimation method on CES index values
Section 4.2 showed that different methods of estimating elasticities of substitution can generate noticeably different estimates. In contrast Sect. 4.3 empirically showed that aggregating data leads to smaller values of estimated elasticities of substitution. However, this raises the natural question of whether differences between estimates of the elasticity of substitution lead to measurable differences among the CES indices that are based on them (see Sect. 3.1). To answer this question, it was decided to determine the CES indices for a full year of observations of the two data sets while taking into account: (a) monthly averaged elasticities of substitution determined for all the algebraic methods considered; (b) elasticities of substitution determined by the method based on panel regression (2SLS), which appears to be the most reliable (see Sect. 5).
Figure 3 presents the above-mentioned CES indices based on elasticities of substitution calculated for non-filtered yoghurt products and for two variants, before and after data aggregation. Figure 4 presents analogous results for other articles for personal hygiene. A surprising conclusion is that despite differences in the estimated elasticity of substitution, the differences, between the CES indices (the Lloyd-Moulton indices) are not noticeable. However, crucial differences between CES indices may occur with disaggregated input data. At the level of the GTIN code, the estimated values of the CES index for other articles for personal hygiene differ by as much as 0.4 p.p. (June, 2022—see Fig. 4).
Fig. 3
CES indices based on elasticities of substitution calculated for yoghurt products
Fig. 4
CES indices based on elasticities of substitution calculated for other articles for personal hygiene
×
×
5 Conclusions
Both the algebraic approach and the panel regression approach are useful and valuable in estimating the elasticity of substitution. However, these approaches differ not only in the computational complexity or the set of data considered, but also their sensitivity to the choice of data filter or data aggregation level. Although the authors are aware that the results should be treated as preliminary, some observations regarding the methods of estimating the elasticity of substitution are presented as general conclusions listed below.
Firstly, algebraic methods generally lead to much less varied values of the estimated elasticity of substitution than methods based on a regression approach. As a rule, the maximum difference of the CES values obtained by algebraic methods does not exceed the level of 0.2\(-\)0.3 (Figs. 1 and 2). At the same time, methods based on panel regression lead to results that differ even several times regardless of the data filters used (Appendix A). Thus, one should be much more careful in choosing the regression rather than the algebraic method when estimating the elasticity of substitution based on scanner data. The algebraic M-LM method is an appropriate choice because it has the strongest theoretical foundation (see Sect. 3.1). Additionally, it generates stable estimates of the elasticity of substitution regardless of the data filter used. The 2SLS method is the most useful because it allows us to consider endogeneity, product or group fixed effects, and the heterogeneity of the mentioned ones. We recognise that our estimates of CES elasticities using panel methods are not unweighted. Due to the lack of availability of relevant statistical data, we did not consider the disposable income of our consumer group, the prices of substitute and complementary goods, or volumes realised in other stores.
Secondly, for both algebraic and panel regression-based methods, data aggregation had a clear effect on the results of the elasticity of substitution. Moving from a disaggregated level (GTIN level) to a higher level of aggregation (COICOP level 6) each time led to a several-fold decrease in the value of the elasticity of substitution (Appendix B).
Thirdly, only for disaggregated data is there a fear that different estimates of the elasticity of substitution will lead to significant differences in the CES indices that are based on them. For data aggregated at the COICOP 6 level, the estimated substitution elasticities are lower, and the differences in these estimates obtained by various methods are also smaller. Therefore, the differences in the corresponding CES indices are also less important. The lack of product homogeneity within specific panels is also relevant here. However, the results are closer to reality than when working with aggregated data or broad product definitions.
It is worth adding at the end that the research results obtained can be important not only for the economy, but also for investment or pension fund managers and directors of banks or financial institutions. Better estimation of inflation is a profit for the state (indexation of pensions is based on the CPI) and for the banking sector (indexation of bank contracts is generally based on the inflation rate). Since the CES index does not require knowledge of the current level of consumption, proper estimation of the elasticity of substitution in effect can give more freedom to the governors of statistical offices in the organization of Household Budget Surveys (they can, for example, reduce their frequency and thus reduce the cost of this survey). Knowledge of the magnitude of the substitution effect of goods in particular product segments can also be valuable to managers of retail chains, who can use this knowledge in determining the sales mix or in pricing policies (elasticity of substitution also shows how relative spending on goods changes when their relative prices change). In summary, the results obtained in the paper may find many potential beneficiaries. However, it is worth noting at this point that the survey we performed had certain limitations due to the availability of this type of data and also due to certain methodological assumptions we made. First, the study is limited to two sets of data from one retail chain. For this reason, the results obtained should be considered preliminary, and it is too early to generalize conclusions. Second, due to the multiplicity of data and hardware limitations, the study did not take into account the aggregation of results against the chain’s outlets, i.e. the results from all outlets formed a single, consistent dataset. Third, the study focused on the post-pandemic period and perhaps, in the pre-pandemic period and almost certainly in the pandemic period, estimates of elasticity of substitution would have been quite different. Therefore, the authors intend to continue the study and expand it in the future by including more retail chains and new product groups.
Declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Elasticities of substitution calculated before and after data aggregation for two considered data sets and two variants (yearly values vs mean of monthly values, non-filtered data)
Method
Before aggregation
After aggregation
Yoghurt (Dec, 2021–Dec, 2022)
M-LM
1.3160
0.7763
M-F
1.3513
0.8593
M-W
1.3418
0.7714
M-T
1.3157
0.7617
M-SV
1.3320
0.7617
Other articles for personal hygiene (Dec, 2021–Dec, 2022)
M-LM
0.5482
0.1770
M-F
0.6080
0.1857
M-W
0.4061
0.2059
M-T
0.4908
0.1681
M-SV
0.4208
0.1870
Yoghurt (mean of monthly elasticities of substitution)
M-LM
1.8498
1.1744
M-F
2.4813
1.1355
M-W
1.7611
1.0321
M-T
1.8152
1.1392
M-SV
1.7787
1.0557
Other articles for personal hygiene (mean of monthly elasticities of substitution)