Open Access 17.05.2025 | Original Article
Variable Selection and Variable Integration for Categorical Dummy Variables in Regression Analysis
Erschienen in: Annals of Data Science
Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.
Wählen Sie Textabschnitte aus um mit Künstlicher Intelligenz passenden Patente zu finden. powered by
Markieren Sie Textabschnitte, um KI-gestützt weitere passende Inhalte zu finden. powered by (Link öffnet in neuem Fenster)
Abstract
1 Introduction
When not only selection but also the integration of multiple categories as explanatory variables in regression analysis are considered, several cumbersome problems instantly arise. The first issue refers to the way of coding. It is well-known that several coding methods for multiple categorical data exist. Most of the scientific community adopt multiple categorical dummy variables and sometimes exclude one of them to avoid perfect multicollinearity with constant terms in regression analysis. In other works, the application of efficient coding methods such as those by Walter et al. [1], Alkharusi [2], and Venkataramana et al. [3] has been explored. Additionally, when the integration of some categories is considered, the methods of coding and the reparameterization of the coefficients for integrated categories are closely related. In some cases, such as ordered categorical dummy variables or when some adjacent categories are defined, the integration of categorical dummy variables becomes a systematic one,recently, a type of estimation variable selection method was proposed. More details can be found in the works of Anderson [4], Gertheiss and Tutz [5], Huang et al. [6], Ohishi et al. [7] and Fukushige [8]. However, as for the unordered categorical variables, most researchers see variables with coefficients near size and integrate them and select variables by minimizing information criteria or other model selection criteria. Most of the textbooks or monographs like Andersen [9, 10], Hardy [11] or Yan and Su [12] treated regression model with categorical dummy variables, but they focused on the coding and did not consider selection and integration of multiple categories.
Related works on multiple categories have recently been reported in the literature. In particular, Yuan and Lin [13] pointed out the problems in applying least absolute shrinkage and selection operator (lasso) for a group of explanatory variables such as a set of categorical dummy variables. As for this problem, Detmer et al. [14] and Wang and Leng [15] discussed the group lasso and adaptive group lasso estimation. In our work, this kind of problem was not considered because such problems are not related to a systematic way of efficiently integrating some categories.
Anzeige
From a traditional viewpoint of model selection with information criterion, the idea is very simple. All the possible combinations of the integrated categorical dummy variables are picked up and regression analysis is conducted. Then, the best combination of the explanatory variables can be found. Even if we limit the case where multiple categorical dummy variables are adopted, there remain two problems in practical estimation procedures. The first one refers to listing all kinds of integrated categorical dummy variables. The other issue is the effective picking up of all the possible combinations of the integrated categorical dummy variables. In this work, we studied this problem thoroughly and propose a relatively efficient estimation method. We also checked the possibility of skipping the picking-up process by using the lasso and examined the possibility of utilizing variable selection criteria for ordered categorical data by the estimated ordering of the coefficients. The reason for using these two additional methods is to save computational time. For this reason, we conducted a simulation study to check the so-called “consistency” of the model selection procedure discussed by Burnham and Anderson [16].
The organization of this work is as follows: Sect. 2 gives details of how to code the categorical data in regression analysis and problems when considering the integration of some categories. Section 3 has three subsections. Section 3.1 gives a proposed method to list all kinds of integration for categorical dummy variables. Section 3.2 introduces a method to pick up all the possible combinations of the integrated categorical dummy variables. Section 3.3 shows a proposed method utilizing adaptive lasso to skip the picking-up process. Section 4 develops another variable selection method utilizing the variable selection method for ordered categorical data. Section 5 shows the results of the simulation studies as an empirical example presented. Finally, Sect. 6 discusses any remaining problems.
2 A Regression Model with Categorical Dummy Variables
We introduce a regression model with multiple categorical dummy variables as an example. Consider the following model:where \({Dum}_{ki}\) are dummy variables for K categories and ith observations. In this case, categorical dummy variables can be set as follows:
$${y_{i}} = \sum\limits_{k = 1}^{{\text{K}}} {\beta_{k} Dum_{ki} + \varepsilon_{i} ,\,\,i = 1,2,3, \ldots ,N}$$
(1)
$$\begin{aligned} Dum_{ki} &= 1 {\text{ if}}\,{\text{the}}\,i{\text{th }}\,{\text{observation}}\,{\text{belongs}}\,{\text{to}}\,{\text{the}}\,k{\text{th }}\,{\text{category}} \hfill \\ &= 0\,\,\,{\text{otherwise}} \hfill \\ \end{aligned}$$
(2)
No constant term was included because dummy variables for all the k categories were used to avoid perfect multicollinearity. When a constant term in a regression equation is included, one of the dummy variables is usually removed. For example, the hth dummy variable was removed and the model could be parameterized as follows:
$$y_{i} = \beta_{h} + \sum\limits_{k \ne h} {\left( {\beta_{k} - \beta_{h} } \right)Dum_{ki} + \varepsilon_{i} ,\,\,\,i = 1,2,3, \ldots ,N.}$$
(3)
Anzeige
The choice of the hth dummy variable depends on the researcher’s choice. In the selection procedure of the model, a dummy variable of which coefficient is statistically significant should be chosen to reach the most parsimonious model.
Of course, this type of coding for multiple categorical data is not a unique solution. For example, Alkharusi [2] and Venkataramana et al. [3] discussed efficient coding and proposed coding without 0–1-type dummy variables. Nonetheless, we did not consider this type of coding because it leads to complex restrictions among the reparameterization of categorical variables.
Additionally, when we consider the integration of categorical dummy variables, the coding with categorical dummy variables is not unique parameterization. For example, we considered the following case:where,
$$y_{i} = \sum\limits_{k = 0}^{{\text{K}}} {\beta_{k} Dum_{ki} + \varepsilon_{i} ,\,\,\,i = 1,2,3, \ldots ,N.}$$
(4)
$$\begin{array}{*{20}c} {\beta_{k} = \delta } & {{\text{when}}\,{\text{k}}\;{\text{is}}\;{\text{odd}}} \\ {\beta_{k} = \theta } & {{\text{when}}\;{\text{k}}\;{\text{is}}\;{\text{even}}.} \\ \end{array}$$
Then, one of the most parsimonious parameterizations of the model becomes
$$y_{i} = \delta + \left( {\theta - \delta } \right)\sum\limits_{for k is even} {Dum_{ki} + \varepsilon_{i} .}$$
(5)
However, another parameterized model:produces the most parsimonious parameterization. Both models are described with two
coefficients. If we list \(\sum\nolimits_{for k is even} {Dum_{ki} }\) and \(\sum\nolimits_{for k is odd} {Dum_{ki} }\) as candidates
in possible integrated dummy variables, regression analysis can be conducted to calculate information criteria. But these two models give exactly the same value for the information
criterion. In such a case, the integrated dummy variables that integrate more than half of the
categorical dummy variables from the list of candidates of the integrated categorical dummy variables are removed. If the total number of categories K is odd, \(\sum\nolimits_{for k is odd} {Dum_{ki} }\) is removed from the list. If the total number of the categories is even, this saving method is not applied. Another example seems more serious. The four-categories case can be considered when the first category is removed:
$$y_{i} = \theta + \left( {\delta - \theta } \right)\sum\limits_{for k is odd} {Dum_{ki} + \varepsilon_{i} .}$$
$$y_{i} = \beta_{1} + \beta_{2} Dum_{2i} + \beta_{3} Dum_{3i} + \beta_{4} Dum_{4i} + \varepsilon_{i} ,\,\,\,i = 1,2,3, \ldots ,N,$$
(6)
We might list integrated dummy variables \(\left({Dum}_{2i}+{Dum}_{3i}\right)\) and \(\left({Dum}_{3i}+{Dum}_{4i}\right)\) as candidates for the explanatory variables. If a combination of these two integrated dummy variables as explanatory variables is picked up, \({Dum}_{3i}\) can be adopted as explanatory variables twice. This case cannot be interpreted as a variation of the case with the integration of some categorical dummy variables. Of course, with a specific restriction between the coefficients like \({\beta }_{3}={\beta }_{2}+{\beta }_{4}\), the model can be parameterized as follows:
$$y_{i} = \beta_{1} + \beta_{2} \left( {Dum_{2i} + Dum_{3i} } \right) + \beta_{4} \left( {Dum_{3i} + Dum_{4i} } \right) + \varepsilon_{i} .$$
(7)
However, it is rare that this type of restriction is known in advance.
3 A Proposed Procedure to Reach the Most Parsimonious Model
In this section, considering the problems above, we propose a practical estimation method to reach the most parsimonious model.
3.1 Listing all the Candidates for Integrating Vectors for Dummy Variables
For the creation of an integrated dummy variable, an integrating vector is multiplied by a matrix of the categorical dummy variables. For example, we start with N observations and four categorical dummy variables cases. An \(N\times 4\) matrix of the categorical dummy variables can be formed:where \(D^{\prime}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {Dum_{j1} } & {Dum_{j2} } & \cdots \\ \end{array} } & {Dum_{jN} } \\ \end{array} } \right]\). When an integrated dummy variable is constructed, an integrating row vector backward to the categorical dummy variables matrix can be multiplied. For example:
$$\left[\begin{array}{cc}\begin{array}{cc}{D}_{1}& {D}_{2}\end{array}& \begin{array}{cc}{D}_{3}& {D}_{4}\end{array}\end{array}\right]$$
(8)
$$\begin{gathered} \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {D_{1} } & {D_{2} } \\ \end{array} } & {\begin{array}{*{20}c} {D_{3} } & {D_{4} } \\ \end{array} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime } = \left[ {D_{1} + D_{2} } \right] \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\, = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {Dum_{11} + Dum_{21} } & {Dum_{12} + Dum_{22} } & \cdots \\ \end{array} } & {Dum_{1N} + Dum_{2N} } \\ \end{array} } \right]^{\prime } . \hfill \\ \end{gathered}$$
(9)
In this example, an integrating (column) vector:
$$\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime }$$
(10)
Is one example of integrating the first and second categorical dummy variables. Then, all the possible integrating vectors can be listed. During this process, attention should be paid to some special kinds of vectors. The first special kind of (column) vector is a vector that finds just one variable, for example:
$$\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 0 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime } .$$
(11)
Specifically, this kind of vector does not integrate the dummy variable but just picks up one variable. However, the multiplication of this kind of vector backward to the categorical dummy variables matrix is still a candidate for the explanatory variables for the regression model. The other two kinds of integrating vectors that should be removed from the list of integrating vectors, are
$$\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime } {\text{and}}\, \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } \\ \end{array} } \right]^{\prime } .$$
(12)
The former just forms a null vector, so this cannot be a candidate for an explanatory variable. The latter forms a vector where all the elements are one. When a constant term is included, this vector leads to perfect collinearity. Thus, we cannot include a candidate in the case of the explanatory variable.
Excluding two special type vectors, we listed all the candidates for integrating vectors. In general, K categorical dummy variables are considered, the integrating vectors are K-tuple vectors, and their elements are one or zero with two exceptions: all the elements are equal to zero and all the elements are one case. To form K-tuple vectors where their elements are one or zero, we first transformed decimal numbers from 1 to 2K − 1 (2 to the K power minus one) to binary numbers. By creating a vector using each digit of the generated binary number as an element, a K-tuple vector can be formed by adding zeros to the beginning so that the total number of digits is K. Then, we can list up to 2K − 2 integrating vectors. These vectors exhaust all candidates for the integrated vector. However, as we discussed in Sect. 2, when a constant term is included in the regression analysis, for one explanatory variable candidate, there is a corresponding explanatory variable that expresses the same model. To remove the corresponding candidates, we removed the integrating vectors with the sum of elements greater than K/2 from the candidates. To summarize the above steps:
-
Step 1: Transform decimal numbers from 1 to 2K − 1 to binary numbers.
-
Step 2: Form the K-tuple vector by adding zeros to the beginning so that the total number of digits is K.
-
Step 3: Remove the vectors with the sum of elements greater than K/2 from the list of candidates.
Through these steps, there remain about 2K–1 candidates. To add zeros to the beginning of each vector, we replaced step 2 by reversing the vector order and forming K-tuple vectors by adding zeros to the end.
3.2 Picking up all the Combinations of the Candidates for Explanatory Variables
After listing about 2K–1 − 1 candidates of integrating vectors, about 2 K–1 − 1 variables can be found as candidates for the explanatory variables in regressions. When a constant term is included, it sometimes leads to perfect collinearity and is not the most parsimonious one if more than K − 1 variables are adopted from the list of the candidates for explanatory variables. To select the most parsimonious model by some criterion, we calculated the criterion by estimating the regression model with 1 through K − 1 combinations of the candidates for explanatory variables. At this step, adopting the same dummy variables twice or more in the form of integrated dummy variables should be avoided. To avoid such cases when two or more integrating vectors are adopted, the row sums of the matrix formed with adopted integrating (column) vectors should be checked. If one of the row sums of the matrix of the adopted integrating vectors is greater than one, such combinations of the integrated dummy variables are not adopted as a set of explanatory variables because some categorical dummy variables were integrated twice. By conducting regression estimation including the remaining combinations of the integrated dummy variables as explanatory variables, the most parsimonious model from the viewpoint of a variable selection criterion was found. The above steps can be summarized as follows:
-
Step 1: Set a number for picking up vectors from the candidates of integrating vectors (M). We start with M equals 1.
-
Step 2: List all the combinations to pick up M vectors from the candidates of integrating vectors.
-
Step 3: Form a matrix by stacking a picked-up integrating vector and calculating each row sum. If one of the row sums is larger than 2, we remove it from the list of the combinations of integrating vectors.
-
Step 4: Conduct regression estimation and calculate the variable selection criterion with explanatory variables formed by multiplying the matrix of the categorical dummy variables and combinations of integrating vectors.
Change the number for picking up vectors from 1 to K − 1 and repeat the steps 1 through 4. During these steps, find the most parsimonious model from the viewpoint of a variable selection criterion. Of course, even in this step, there remain some cases that represent the same data generating process with different combinations of the dummy variables. In such cases, the same value of the variable selection criteria can be obtained exactly, which means that this process is not completely efficient. During the actual variable selection process, when exactly the same value of variable selection criteria is obtained, it is not a problem to adopt a combination of variables that first gives the best combinations of the variables among them.
3.3 Skipping the Step of Picking up the Combinations of Explanatory Variables by Lasso
Lasso is a shrinkage method that imposes a penalty on each coefficient size as in ridge regression. This type of method can be adopted as a variable selection method. Standard lasso estimation yields a penalty on each coefficient size as in ridge regression:where RSS is the residual sum of squares and p and \(\left|{\beta }_{i}\right|\) are the number of estimated parameters and estimated coefficients of the regression, respectively. However, the lasso estimator may be inefficient and inconsistent. To overcome these problems, Zou [17] proposed an adaptive lasso:
$${\text{minimize }}\,{\text{RSS}} + \lambda \mathop \sum \limits_{i = 1}^{p} \left| {\beta_{i} } \right|$$
(13)
$${\text{minimize}}\,{\text{RSS}} + \lambda \mathop \sum \limits_{i = 1}^{p} w_{i} \left| {\beta_{i} } \right|$$
(14)
Minimizing the object function, the true model can be reached when the sample size
becomes infinite. Oracle properties discussed by several researchers, e.g., Zou [17], include this
consistency. In practice, we adopt cross validation to minimize RSS for selecting optimal tuning
parameter \(\uplambda\). Using estimated coefficients
by ridge regression \(\left|{\beta }_{i}^{*}\right|\), the weighting parameter can be set as \(w_{i} = 1/\left| {\beta_{i}^{*} } \right|\) and cross validation can be applied to minimize RSS for selecting optimal tuning parameter \(\uplambda\) in adaptive lasso.
Instead of traditional information criteria that use sample size and numbers of the parameters to be estimated, recent lasso and adaptive lasso use the absolute magnitudes of the estimated parameters. Additionally, Hebiri and Lederer [18] showed that the correlations among explanatory variables affect the optimal tuning parameter in lasso estimation and showed that the cross validation provides a suitable choice in many applications. However, while the original categorical dummy variables are mutually uncorrelated, the integrated dummy variables, which were discussed in Sect. 4, exhibit correlations. This kind of change might introduce the results of the variable selection.
In the following simulation study in Sect. 5, we applied adaptive lasso estimation for the regression model with all the candidates for the explanatory variables. One of the most advantageous properties is that lasso estimation can be applied for too many explanatory variable cases even when the number of the explanatory variables is larger than the number of the observations and when the set of the explanatory variables is linearly dependent. The latter case implies that the variance–covariance matrix of the explanatory variables is degenerated. This property is advantageous because the picking-up process considered in subSect. 3.2 can be skipped. As a result, adaptive lasso estimation can be just applied with all the listed integrated categorical dummy variables as the explanatory variables at once. Additionally, Huang et al. [6] and Ohishi et al. [7] proposed lasso estimation for categorical data using adjacent properties like ordered categories. However, this method cannot be applied to unordered categorical variable cases.
4 Another Method that Utilizes the Method for Ordered Categorical Variable
If the category is a kind of ordered one, at first, the regression model with K ordered categorical dummy variables can be defined as follows:where the constant term was removed to avoid perfect multicollinearity. Then, because the category is a kind of ordered one, new dummy variables were constructed that represent the accumulated effects of the jth category and over as follows:where \({ADum}_{1,i}\) becomes the variable for the constant term. Using these dummy variables and defining new parameters:we can rewrite Eq. (15) with the accumulated dummy variables as follows:where we set \({\delta }_{1}={\beta }_{1}\). Each \({\beta }_{j}\) can be estimated as follows:
$${y}_{i}=\sum_{k=1}^{\text{K}}{\beta }_{k}{Dum}_{ki}$$
(15)
$$ADum_{ji} = \sum\limits_{k = j}^{K} {Dum_{ki} , j = 1,2, \ldots ,K}$$
(16)
$$\delta_{k} = \beta_{k} - \beta_{k - 1} , k = 2,3, \ldots ,K$$
(17)
$$y_{i} = \beta_{1} *ADum_{1i} + \sum\limits_{k = 2}^{K} {\left( {\beta_{k} - \beta_{k - 1} } \right)} *ADum_{ki} = \sum\limits_{k = 1}^{K} {\delta_{k} *ADum_{ki} }$$
(18)
$${\beta }_{j}=\sum_{j=1}^{j}{\delta }_{k}.$$
(19)
By applying the estimation method to the following equation:the problems with model selection and integrating some categories (dummy variables) can be solved by minimizing Akaike’s Information Criterion (AIC) by Akaike [19, 20] or Schwarz’s Bayesian Information Criterion (BIC) by Schwarz [21] or other selection methods. This type of transformation is proposed by Tian et al. [22]. This type of solution can be easily applied for ordered categorical variables cases. However, for the unordered categorical variable cases, some practical scheme is needed. At first, we estimated the regression model (15) by OLS with or without some other explanatory variables. After that, using the estimated coefficients for categorical dummy variables \(\widehat{{\beta }_{j}} j=\text{1,2},\dots ,K\), we ordered the categorical dummy variables according to the estimated coefficient in ascending or descending order. This allowed us to rewrite Eq. (15) in the following form:where \(\widehat{{\beta }_{\left(1\right)}}\le \widehat{{\beta }_{\left(2\right)}}\le \cdots \le \widehat{{\beta }_{\left(K\right)}}\) if we use ascending order. After this process, we considered these ordered categorical dummy variables: \({Dum}_{\left(k\right)i}, k=\text{1,2},\dots ,K\) as dummy variables for ordered categorical data and constructed new dummy variables:
$${y}_{i}=\sum_{k=1}^{\text{K}}{\delta }_{k}{ADum}_{ki}+{\varepsilon }_{i}\begin{array}{cc},& i=\text{1,2},3,\dots ,N\end{array}$$
(20)
$${\widehat{y}}_{i}=\sum_{k=1}^{\text{K}}\widehat{{\beta }_{\left(k\right)}}{Dum}_{\left(k\right)i}$$
(21)
$${\widehat{ADum}}_{ji}=\sum_{k=j}^{K}{Dum}_{\left(k\right)i}, j=\text{1,2},\dots ,K.$$
(22)
Using these dummy variables, we can apply information criteria or other variable selection procedures for the following model:
$$y_{i} = \sum\limits_{k = 1}^{{\text{K}}} {\delta_{k} \widehat{ADum}_{ji} + \varepsilon_{i} ,\,\,i = 1,2,3, \ldots ,N}$$
(23)
5 Simulation Study
5.1 Information Criteria
Before describing our simulation study, we explain the variable selection criteria and other methods. As Burnham and Anderson [16] surveyed, there are several criteria. One of the most popular information criteria is AIC:where RSS is the residual sum of squares and n and p are the number of observations and estimated parameters when maximized likelihood estimation for the Gaussian linear regression model is conducted. The best model found by minimizing AIC is selected. However, AIC does not give the true model that generates data even when the sample size becomes infinite. More details can be found in the work of Burnham and Anderson [16]. This property is called inconsistency. Another popular but consistent information criterion is BIC:where RSS is the residual sum of squares and n and p are the number of observations and estimated parameters, respectively. The best model is selected by minimizing BIC. Both AIC and BIC are the most popular information criteria and are adopted by several commands in software packages for regression analysis, e.g., R, STATA, and others.
$${\text{AIC}} = {\text{n}}*\ln \left( {\frac{{{\text{RSS}}}}{{\text{n}}}} \right) + 2{\text{p}}$$
(24)
$${\text{BIC}} = {\text{n}}*\ln \left({\frac{{{\text{RSS}}}}{{\text{n}}}} \right) + 2{\text{p}}*\ln \left(n \right)$$
(25)
5.2 Simulation Set-Ups for the Case of Five Categorical Dummy Variables
In this section, we consider the simulation set-ups when the categorical variable has five categories. We used five categorical dummy variables as explanatory variables and set the distributions of the observation for each category. In this paper, we adopted two cases:where we assume K and N are even numbers. To conduct a simulation study, we set K = 5 and N = 300 or 3000. The two cases were assumed to originate from the difference in the variance of the explanatory variables on the variable selection. In case 1, the variance of the dummy variable was the same, but in case 2, the variances of the dummy variable between odd and even numbers were different.
$$\begin{gathered} Case 1: \# \left\{ {i \in \left( {k{\text{th category}}} \right)} \right\} = \frac{{\text{N}}}{{\text{K}}} \hfill \\ Case 2: \# \left\{ {i \in \left( {k{\text{th category}}} \right)} \right\} = \frac{{\text{N}}}{{3{\text{K}}}}{\text{ if k is odd number}} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\# \left\{ {i \in \left( {k{\text{th category}}} \right)} \right\} = \frac{{2{\text{N}}}}{{3{\text{K}}}}{\text{ if k is even number}} \hfill \\ \end{gathered}$$
(26)
Using the above settings, we first generated the dependent variable as follows:where a generated random number for \({\varepsilon }_{i}\) acts as a standard normal distribution. This case assumes that all the coefficients are mutually different with the same discrepancies. If the selection criterion chooses the true number of the explanatory variables, the selected number becomes five including the constant term. Another data generating process was assumed as follows:
$${y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}$$
(27)
$${y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}.$$
(28)
In this case, if the selection criterion chooses the true number of explanatory variables, the selected number becomes three including the constant term. The other data generating process was assumed as follows:
$${y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}$$
(29)
In this case, if the selection criterion chooses the true number of explanatory variables, the selected number becomes two including the constant term. As for the settings about \({\alpha }_{1}\), we use in Setting 1: \({\alpha }_{1}=0.5\) and in Setting 2: \({\alpha }_{1}=5.0\). The two values of \({\alpha }_{1}\) were assumed to stem from the impact of the relative size between the size of the coefficient and the standard deviation of the error term. In the case of \({\alpha }_{1}=0.5\), the coefficient was smaller than the standard deviation of the error, and in the case of \({\alpha }_{1}=5.0\), the coefficient was more than twice the standard deviation of the error term.
5.3 Simulation Results
We first show the results of the proposed method in Sect. 3 for the selected number of explanatory variables including constant terms when we set N = 300 (relatively small sample cases) for six data generating set-ups with Case 1 dummy variables in Table 1. These results were obtained with 5000 replications. Table 2 presents the selected number of explanatory variables of the same set-ups with Case 2 dummy variables. Tables 3 and 4 report the same settings when we set N = 3000 (relatively large sample cases). These four tables report the results of the variable selection method by AIC and BIC and applied adaptive lasso with all the candidates of the explanatory variables; in this case, the number of the candidates is 15 (24 − 1), as in Sect. 3. The reported numbers in bold indicate the cases when the correct number of explanatory variables were selected. In these tables, we only report the selected number of explanatory variables to investigate the dimension consistency of the variable selection procedures, as Burnham and Anderson [16] pointed out.
Table 1
Simulation results of Case 1 when N = 300 with 5000 replications
Selected number of the explanatory variables including the constant term | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 0 | 120 | 1578 | 3302 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4 | 1778 | 2802 | 416 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 4 | 141 | 4855 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 160 | 3867 | 949 | 24 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 1538 | 3448 | 14 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 59 | 642 | 1845 | 2454 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 3556 | 1324 | 120 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 4824 | 174 | 2 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 818 | 1999 | 2183 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 2999 | 1916 | 85 | 0 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4851 | 149 | 0 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 21 | 1096 | 1728 | 1262 | 889 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 2762 | 1991 | 245 | 2 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4697 | 298 | 5 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 2011 | 1465 | 813 | 711 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Table 2
Simulation results of Case 2 when N = 300 with 5000 replications
Selected number of the explanatory variables including the constant term | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=0.5\) | ||||||||||||||
AIC | 0 | 28 | 1568 | 3140 | 264 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 581 | 3265 | 1150 | 4 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 7 | 59 | 426 | 1729 | 2082 | 616 | 71 | 10 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 0 | 179 | 1952 | 2869 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{k}=0.5\) | ||||||||||||||
AIC | 0 | 1217 | 3043 | 726 | 14 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 3598 | 1381 | 21 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 34 | 210 | 752 | 1569 | 1509 | 707 | 101 | 47 | 22 | 30 | 17 | 2 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{k}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 2910 | 1809 | 281 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 4738 | 257 | 5 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 384 | 1828 | 1731 | 643 | 360 | 41 | 13 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{k}=0.5\) | ||||||||||||||
AIC | 0 | 2718 | 2008 | 269 | 5 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4690 | 309 | 1 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 45 | 436 | 1219 | 1589 | 1172 | 474 | 35 | 6 | 10 | 8 | 5 | 1 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{k}=5.0\) | ||||||||||||||
AIC | 0 | 2282 | 2172 | 508 | 38 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4565 | 426 | 9 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 832 | 1855 | 1191 | 625 | 417 | 74 | 6 | 0 | 0 | 0 | 0 | 0 | 0 |
Table 3
Simulation results of Case 1 when N = 3000 with 5000 replications
Selected number of the explanatory variables including the constant term | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 0 | 0 | 5000 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 0 | 3584 | 1303 | 113 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 4957 | 43 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 799 | 1960 | 2241 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 3584 | 1303 | 113 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 4957 | 43 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 1481 | 2545 | 974 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 2779 | 1980 | 240 | 1 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4921 | 79 | 0 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 1713 | 1566 | 951 | 770 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 2779 | 1980 | 240 | 1 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4921 | 79 | 0 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 2271 | 1501 | 821 | 407 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Table 4
Simulation results of Case 2 when N = 3000 with 5000 replications
Selected number of the explanatory variables including the constant term | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=0.5\) | ||||||||||||||
AIC | 0 | 0 | 1 | 243 | 4756 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 354 | 2917 | 1729 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 0 | 150 | 1688 | 2938 | 146 | 78 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 0 | 0 | 5000 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 0 | 695 | 3375 | 930 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 0 | 3119 | 1735 | 146 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 121 | 4830 | 49 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 321 | 1665 | 1702 | 778 | 180 | 97 | 149 | 76 | 27 | 5 | 0 | 0 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 0 | 2913 | 1809 | 278 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 0 | 4922 | 78 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 0 | 452 | 2193 | 1816 | 434 | 93 | 12 | 0 | 0 | 0 | 0 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\) | ||||||||||||||
AIC | 0 | 2283 | 2252 | 449 | 16 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4875 | 125 | 0 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 584 | 1441 | 1453 | 979 | 314 | 72 | 66 | 63 | 24 | 2 | 2 | 0 | 0 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\) | ||||||||||||||
AIC | 0 | 2266 | 2171 | 526 | 37 | – | – | – | – | – | – | – | – | – |
BIC | 0 | 4875 | 125 | 0 | 0 | – | – | – | – | – | – | – | – | – |
Adaptive lasso | 0 | 1004 | 2043 | 1076 | 573 | 257 | 44 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
As regards the selected number of variables, the results from Tables 1, 2, 3, 4 can be summarized as follows.
BIC and AIC tend to impose stricter penalties compared with adaptive lasso when the magnitudes of all coefficients differ and the differences are relatively small, leading to the selection of models smaller than the true dimension. This tendency persists even when the sample size is 3000.
In cases where category integration is necessary, the results of variable selection by adaptive lasso tend to favor models larger than the true dimension, with a lower probability of selecting the true dimension compared with results based on AIC or BIC.
In Case 2, where the variances of explanatory variables differ, the results of variable selection by adaptive lasso tend to choose models larger than the true dimension, exhibiting a tendency to select more variables than the original number of categories, which is 5.
In cases where category integration is necessary, if the sample size is small and the differences in coefficients between variables are small, AIC tends to be higher than BIC. However, when the sample size is large or the differences in coefficients between variables are large, BIC has a higher probability of selecting the true dimension.
In the case of a sample size of 3000, regardless of the case or model, BIC has a higher probability of selecting the true dimension compared with other criteria.
As for the case utilizing the nature of ordered categorical dummy variables proposed in Sect. 4, Table 5 for Case 1 and Table 6 for Case 2 report the simulation results in the same set-ups of Tables 1, 2, 3, 4. The results from Tables 5 and 6 compared with Tables 1, 2, 3, 4 can be summarized as follows.
Table 5
Simulation results of a selected number of aggregated dummy variables including constant term of Case 1 and Case 2 when N = 300 with 5000 replications
Case 1 | Case 2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | ||
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\) | |||||||||||
AIC | 0 | 0 | 120 | 1578 | 3302 | AIC | 0 | 328 | 2962 | 1705 | 5 |
BIC | 0 | 4 | 1772 | 2808 | 416 | BIC | 6 | 547 | 4065 | 382 | 0 |
Adaptive lasso | 0 | 1 | 71 | 728 | 4200 | Adaptive lasso | 1 | 102 | 1423 | 2455 | 1019 |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\) | |||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | AIC | 0 | 0 | 0 | 0 | 5000 |
BIC | 0 | 0 | 0 | 0 | 5000 | BIC | 0 | 0 | 0 | 0 | 5000 |
Adaptive lasso | 0 | 0 | 0 | 0 | 5000 | Adaptive lasso | 0 | 0 | 0 | 0 | 5000 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\) | |||||||||||
AIC | 0 | 159 | 3844 | 973 | 24 | AIC | 30 | 1120 | 3330 | 501 | 19 |
BIC | 0 | 1476 | 3500 | 24 | 0 | BIC | 596 | 2196 | 2174 | 33 | 1 |
Adaptive lasso | 0 | 170 | 1820 | 2011 | 999 | Adaptive lasso | 56 | 836 | 1659 | 1839 | 610 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\) | |||||||||||
AIC | 0 | 0 | 3556 | 1324 | 120 | AIC | 0 | 0 | 4762 | 234 | 4 |
BIC | 0 | 0 | 4824 | 174 | 2 | BIC | 0 | 0 | 4999 | 1 | 0 |
Adaptive lasso | 0 | 0 | 4813 | 145 | 42 | Adaptive lasso | 0 | 0 | 3400 | 1227 | 373 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\) | |||||||||||
AIC | 7 | 2997 | 1762 | 225 | 9 | AIC | 52 | 1619 | 2763 | 530 | 36 |
BIC | 122 | 4680 | 189 | 9 | 0 | BIC | 634 | 2481 | 1836 | 48 | 1 |
Adaptive lasso | 24 | 1918 | 1796 | 926 | 336 | Adaptive lasso | 72 | 998 | 1761 | 1496 | 673 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\) | |||||||||||
AIC | 0 | 2839 | 1825 | 324 | 12 | AIC | 0 | 1256 | 3294 | 435 | 15 |
BIC | 0 | 4721 | 265 | 14 | 0 | BIC | 0 | 2513 | 2457 | 30 | 0 |
Adaptive lasso | 0 | 4680 | 216 | 73 | 31 | Adaptive lasso | 0 | 1527 | 1991 | 698 | 784 |
Table 6
Simulation results of a selected number of aggregated dummy variables including constant term of Case 1 and Case 2 when N = 3000 with 5000 replications
Case 1 | Case 2 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | ||
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\) | |||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | AIC | 0 | 0 | 3 | 331 | 4666 |
BIC | 0 | 0 | 0 | 0 | 5000 | BIC | 0 | 0 | 767 | 2819 | 1414 |
Adaptive lasso | 0 | 0 | 0 | 0 | 5000 | Adaptive lasso | 0 | 0 | 3 | 178 | 4819 |
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\) | |||||||||||
AIC | 0 | 0 | 0 | 0 | 5000 | AIC | 0 | 0 | 0 | 0 | 5000 |
BIC | 0 | 0 | 0 | 0 | 5000 | BIC | 0 | 0 | 0 | 0 | 5000 |
Adaptive lasso | 0 | 0 | 0 | 0 | 5000 | Adaptive lasso | 0 | 0 | 0 | 0 | 5000 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\) | |||||||||||
AIC | 0 | 0 | 3584 | 1303 | 113 | AIC | 0 | 2 | 3729 | 1210 | 59 |
BIC | 0 | 0 | 4957 | 43 | 0 | BIC | 0 | 5 | 4965 | 30 | 0 |
Adaptive lasso | 0 | 0 | 3579 | 1091 | 330 | Adaptive lasso | 0 | 2 | 2625 | 1818 | 555 |
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\) | |||||||||||
AIC | 0 | 0 | 3584 | 1303 | 113 | AIC | 0 | 0 | 4773 | 226 | 1 |
BIC | 0 | 0 | 4957 | 43 | 0 | BIC | 0 | 0 | 5000 | 0 | 0 |
Adaptive lasso | 0 | 0 | 5000 | 0 | 0 | Adaptive lasso | 0 | 0 | 4792 | 192 | 16 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\) | |||||||||||
AIC | 0 | 2872 | 1810 | 295 | 23 | AIC | 0 | 1897 | 2388 | 676 | 39 |
BIC | 0 | 4925 | 73 | 2 | 0 | BIC | 0 | 2400 | 2584 | 16 | 0 |
Adaptive lasso | 0 | 3245 | 1141 | 468 | 146 | Adaptive lasso | 0 | 1063 | 2278 | 1302 | 357 |
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\) | |||||||||||
AIC | 0 | 2872 | 1810 | 295 | 23 | AIC | 0 | 0 | 4492 | 491 | 17 |
BIC | 0 | 4925 | 73 | 2 | 0 | BIC | 0 | 0 | 4997 | 3 | 0 |
Adaptive lasso | 0 | 5000 | 0 | 0 | 0 | Adaptive lasso | 0 | 2382 | 2480 | 86 | 52 |
When utilizing the nature of ordered categorical dummy variables, BIC and AIC tend to impose stricter penalties compared with adaptive lasso when the magnitudes of all coefficients differ, and the differences are relatively small. This tendency persists even when the sample size is 3,000. However, the results are not significantly different from the results of Tables 1 and 2.
From these results, we suggest that, except for cases where the magnitudes of all coefficients differ and their differences are relatively small, BIC performs better in selecting the true dimension of the model compared with the results of AIC and adaptive lasso.
When utilizing the nature of ordered categorical dummy variables, in Case 1, the probabilities of selecting the true dimension were not significantly different when comparing the results of Table 5 with those of Table 1, or when comparing the results of Table 6 with those of Table 3. These results imply that the highest probability of selecting the true dimension by BIC is like those from Tables 1, 2, 3, 4 and suggest the practicality of utilizing the property of the ordered categorical dummy variables.
When utilizing the nature of ordered categorical dummy variables, in Case 2, the probabilities of selecting the true dimension were higher in Tables 2 and 4 compared with the results of comparing Table 5 with Table 2, or Table 6 with Table 4. This suggests that when the variances of variables differ, the ordering of dummy variables by least squares in the first stage may not work well and it makes selecting the true dimension by AIC or BIC difficult.
The findings suggest that utilizing the properties of ordered categorical dummy variables for variable selection and integration is practical in cases where the number of observations per category is roughly the same. However, this method is not practical when the number of observations per category varies significantly.
6 Discussion
In this work, we propose a new variable selection procedure including the integration of some variables for categorical dummy variables in regression analysis.
Picking up all the candidates of the explanatory variables and possible combinations of them, we searched for the optimal model by AIC and BIC. In this procedure, BIC performed relatively well, but it took too much time practically. Another procedure to skip choosing possible combinations of the explanatory variables and to search an optimal model among all the candidates of the explanatory variables by lasso did not perform well. In some cases, the other procedure to utilize the property of the ordered categorical dummy variables showed similar performance to the procedure that searched for the optimal combinations. This result depends on the first step that orders the categorical dummy variables by estimated coefficients. Additionally, in this work, we checked the performances of adaptive lasso estimation. In most cases, from the dimension consistency point of view, adaptive lasso estimation did not perform better than minimizing the BIC procedure.
Finally, from a practical point of view, the proposed method to search for a minimum AIC model or minimum BIC model required much time. When the categories become more than 10 or there are several types of categories as explanatory variables in regression analysis, it takes too much time to find the optimal model. We should proceed the improving picking-up the combinations of the candidate variables more efficiently or adopt a model selection method that imposed a harder penalty in an adaptive lasso-like method. Additionally, in this paper, we focus only on the dimension consistency of the variable selection procedures, but there remains a possibility that the selected dummy variable can be converted to a dummy variable set that is easier to understand, but this problem is closely related to what kind of category it is applied, and is considered to be a case-by-case problem.
Acknowledgements
This work is partially supported by the Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (B) 23H00806.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.