Zum Inhalt

Open Access 17.05.2025 | Original Article

Variable Selection and Variable Integration for Categorical Dummy Variables in Regression Analysis

verfasst von: Mototsugu Fukushige

Erschienen in: Annals of Data Science

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Die Integration und Selektion kategorischer Dummy-Variablen in die Regressionsanalyse stellt einzigartige Herausforderungen dar, insbesondere wenn mehrere Kategorien als erklärende Variablen betrachtet werden. Dieser Artikel untersucht minutiös verschiedene Kodierungsmethoden, von traditionellen variablen Ansätzen für Dummys bis hin zu effizienteren Techniken, die von Forschern wie Walter et al. und Alkharusi. Sie befasst sich mit den Feinheiten der Kodierung und Reparaturparametrisierung, insbesondere wenn Kategorien systematisch integriert werden, wie etwa in geordnete kategorische Daten. Der Artikel schlägt eine praktische Schätzmethode vor, um das sparsamste Modell zu erreichen. Er beschreibt Schritte, um alle möglichen integrierten Dummy-Variablen aufzulisten und die optimalen Kombinationen auszuwählen. Simulationsstudien werden durchgeführt, um die Leistung verschiedener variabler Auswahlkriterien zu bewerten, darunter AIC, BIC und adaptives Lasso. Die Ergebnisse heben die Stärken und Grenzen der einzelnen Methoden hervor und bieten Einblicke in ihre Anwendbarkeit in verschiedenen Szenarien. Der Artikel untersucht auch den Einsatz von adaptivem Lasso und geordneten kategorischen Variablen und bietet ein differenziertes Verständnis ihrer Wirksamkeit bei der Auswahl und Integration von Variablen. Die Leser werden ein tieferes Verständnis der Komplexität im Umgang mit kategorischen Dummy-Variablen und der innovativen Ansätze zur Bewältigung dieser Herausforderungen erhalten.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

When not only selection but also the integration of multiple categories as explanatory variables in regression analysis are considered, several cumbersome problems instantly arise. The first issue refers to the way of coding. It is well-known that several coding methods for multiple categorical data exist. Most of the scientific community adopt multiple categorical dummy variables and sometimes exclude one of them to avoid perfect multicollinearity with constant terms in regression analysis. In other works, the application of efficient coding methods such as those by Walter et al. [1], Alkharusi [2], and Venkataramana et al. [3] has been explored. Additionally, when the integration of some categories is considered, the methods of coding and the reparameterization of the coefficients for integrated categories are closely related. In some cases, such as ordered categorical dummy variables or when some adjacent categories are defined, the integration of categorical dummy variables becomes a systematic one,recently, a type of estimation variable selection method was proposed. More details can be found in the works of Anderson [4], Gertheiss and Tutz [5], Huang et al. [6], Ohishi et al. [7] and Fukushige [8]. However, as for the unordered categorical variables, most researchers see variables with coefficients near size and integrate them and select variables by minimizing information criteria or other model selection criteria. Most of the textbooks or monographs like Andersen [9, 10], Hardy [11] or Yan and Su [12] treated regression model with categorical dummy variables, but they focused on the coding and did not consider selection and integration of multiple categories.
Related works on multiple categories have recently been reported in the literature. In particular, Yuan and Lin [13] pointed out the problems in applying least absolute shrinkage and selection operator (lasso) for a group of explanatory variables such as a set of categorical dummy variables. As for this problem, Detmer et al. [14] and Wang and Leng [15] discussed the group lasso and adaptive group lasso estimation. In our work, this kind of problem was not considered because such problems are not related to a systematic way of efficiently integrating some categories.
From a traditional viewpoint of model selection with information criterion, the idea is very simple. All the possible combinations of the integrated categorical dummy variables are picked up and regression analysis is conducted. Then, the best combination of the explanatory variables can be found. Even if we limit the case where multiple categorical dummy variables are adopted, there remain two problems in practical estimation procedures. The first one refers to listing all kinds of integrated categorical dummy variables. The other issue is the effective picking up of all the possible combinations of the integrated categorical dummy variables. In this work, we studied this problem thoroughly and propose a relatively efficient estimation method. We also checked the possibility of skipping the picking-up process by using the lasso and examined the possibility of utilizing variable selection criteria for ordered categorical data by the estimated ordering of the coefficients. The reason for using these two additional methods is to save computational time. For this reason, we conducted a simulation study to check the so-called “consistency” of the model selection procedure discussed by Burnham and Anderson [16].
The organization of this work is as follows: Sect. 2 gives details of how to code the categorical data in regression analysis and problems when considering the integration of some categories. Section 3 has three subsections. Section 3.1 gives a proposed method to list all kinds of integration for categorical dummy variables. Section 3.2 introduces a method to pick up all the possible combinations of the integrated categorical dummy variables. Section 3.3 shows a proposed method utilizing adaptive lasso to skip the picking-up process. Section 4 develops another variable selection method utilizing the variable selection method for ordered categorical data. Section 5 shows the results of the simulation studies as an empirical example presented. Finally, Sect. 6 discusses any remaining problems.

2 A Regression Model with Categorical Dummy Variables

We introduce a regression model with multiple categorical dummy variables as an example. Consider the following model:
$${y_{i}} = \sum\limits_{k = 1}^{{\text{K}}} {\beta_{k} Dum_{ki} + \varepsilon_{i} ,\,\,i = 1,2,3, \ldots ,N}$$
(1)
where \({Dum}_{ki}\) are dummy variables for K categories and ith observations. In this case, categorical dummy variables can be set as follows:
$$\begin{aligned} Dum_{ki} &= 1 {\text{ if}}\,{\text{the}}\,i{\text{th }}\,{\text{observation}}\,{\text{belongs}}\,{\text{to}}\,{\text{the}}\,k{\text{th }}\,{\text{category}} \hfill \\ &= 0\,\,\,{\text{otherwise}} \hfill \\ \end{aligned}$$
(2)
No constant term was included because dummy variables for all the k categories were used to avoid perfect multicollinearity. When a constant term in a regression equation is included, one of the dummy variables is usually removed. For example, the hth dummy variable was removed and the model could be parameterized as follows:
$$y_{i} = \beta_{h} + \sum\limits_{k \ne h} {\left( {\beta_{k} - \beta_{h} } \right)Dum_{ki} + \varepsilon_{i} ,\,\,\,i = 1,2,3, \ldots ,N.}$$
(3)
The choice of the hth dummy variable depends on the researcher’s choice. In the selection procedure of the model, a dummy variable of which coefficient is statistically significant should be chosen to reach the most parsimonious model.
Of course, this type of coding for multiple categorical data is not a unique solution. For example, Alkharusi [2] and Venkataramana et al. [3] discussed efficient coding and proposed coding without 0–1-type dummy variables. Nonetheless, we did not consider this type of coding because it leads to complex restrictions among the reparameterization of categorical variables.
Additionally, when we consider the integration of categorical dummy variables, the coding with categorical dummy variables is not unique parameterization. For example, we considered the following case:
$$y_{i} = \sum\limits_{k = 0}^{{\text{K}}} {\beta_{k} Dum_{ki} + \varepsilon_{i} ,\,\,\,i = 1,2,3, \ldots ,N.}$$
(4)
where,
$$\begin{array}{*{20}c} {\beta_{k} = \delta } & {{\text{when}}\,{\text{k}}\;{\text{is}}\;{\text{odd}}} \\ {\beta_{k} = \theta } & {{\text{when}}\;{\text{k}}\;{\text{is}}\;{\text{even}}.} \\ \end{array}$$
Then, one of the most parsimonious parameterizations of the model becomes
$$y_{i} = \delta + \left( {\theta - \delta } \right)\sum\limits_{for k is even} {Dum_{ki} + \varepsilon_{i} .}$$
(5)
However, another parameterized model:
$$y_{i} = \theta + \left( {\delta - \theta } \right)\sum\limits_{for k is odd} {Dum_{ki} + \varepsilon_{i} .}$$
produces the most parsimonious parameterization. Both models are described with two coefficients. If we list \(\sum\nolimits_{for k is even} {Dum_{ki} }\) and \(\sum\nolimits_{for k is odd} {Dum_{ki} }\) as candidates in possible integrated dummy variables, regression analysis can be conducted to calculate information criteria. But these two models give exactly the same value for the information criterion. In such a case, the integrated dummy variables that integrate more than half of the categorical dummy variables from the list of candidates of the integrated categorical dummy variables are removed. If the total number of categories K is odd, \(\sum\nolimits_{for k is odd} {Dum_{ki} }\) is removed from the list. If the total number of the categories is even, this saving method is not applied. Another example seems more serious. The four-categories case can be considered when the first category is removed:
$$y_{i} = \beta_{1} + \beta_{2} Dum_{2i} + \beta_{3} Dum_{3i} + \beta_{4} Dum_{4i} + \varepsilon_{i} ,\,\,\,i = 1,2,3, \ldots ,N,$$
(6)
We might list integrated dummy variables \(\left({Dum}_{2i}+{Dum}_{3i}\right)\) and \(\left({Dum}_{3i}+{Dum}_{4i}\right)\) as candidates for the explanatory variables. If a combination of these two integrated dummy variables as explanatory variables is picked up, \({Dum}_{3i}\) can be adopted as explanatory variables twice. This case cannot be interpreted as a variation of the case with the integration of some categorical dummy variables. Of course, with a specific restriction between the coefficients like \({\beta }_{3}={\beta }_{2}+{\beta }_{4}\), the model can be parameterized as follows:
$$y_{i} = \beta_{1} + \beta_{2} \left( {Dum_{2i} + Dum_{3i} } \right) + \beta_{4} \left( {Dum_{3i} + Dum_{4i} } \right) + \varepsilon_{i} .$$
(7)
However, it is rare that this type of restriction is known in advance.

3 A Proposed Procedure to Reach the Most Parsimonious Model

In this section, considering the problems above, we propose a practical estimation method to reach the most parsimonious model.

3.1 Listing all the Candidates for Integrating Vectors for Dummy Variables

For the creation of an integrated dummy variable, an integrating vector is multiplied by a matrix of the categorical dummy variables. For example, we start with N observations and four categorical dummy variables cases. An \(N\times 4\) matrix of the categorical dummy variables can be formed:
$$\left[\begin{array}{cc}\begin{array}{cc}{D}_{1}& {D}_{2}\end{array}& \begin{array}{cc}{D}_{3}& {D}_{4}\end{array}\end{array}\right]$$
(8)
where \(D^{\prime}_{j} = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {Dum_{j1} } & {Dum_{j2} } & \cdots \\ \end{array} } & {Dum_{jN} } \\ \end{array} } \right]\). When an integrated dummy variable is constructed, an integrating row vector backward to the categorical dummy variables matrix can be multiplied. For example:
$$\begin{gathered} \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {D_{1} } & {D_{2} } \\ \end{array} } & {\begin{array}{*{20}c} {D_{3} } & {D_{4} } \\ \end{array} } \\ \end{array} } \right]\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime } = \left[ {D_{1} + D_{2} } \right] \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\, = \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} {Dum_{11} + Dum_{21} } & {Dum_{12} + Dum_{22} } & \cdots \\ \end{array} } & {Dum_{1N} + Dum_{2N} } \\ \end{array} } \right]^{\prime } . \hfill \\ \end{gathered}$$
(9)
In this example, an integrating (column) vector:
$$\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime }$$
(10)
Is one example of integrating the first and second categorical dummy variables. Then, all the possible integrating vectors can be listed. During this process, attention should be paid to some special kinds of vectors. The first special kind of (column) vector is a vector that finds just one variable, for example:
$$\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 0 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime } .$$
(11)
Specifically, this kind of vector does not integrate the dummy variable but just picks up one variable. However, the multiplication of this kind of vector backward to the categorical dummy variables matrix is still a candidate for the explanatory variables for the regression model. The other two kinds of integrating vectors that should be removed from the list of integrating vectors, are
$$\left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } & {\begin{array}{*{20}c} 0 & 0 \\ \end{array} } \\ \end{array} } \right]^{\prime } {\text{and}}\, \left[ {\begin{array}{*{20}c} {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 & 1 \\ \end{array} } \\ \end{array} } \right]^{\prime } .$$
(12)
The former just forms a null vector, so this cannot be a candidate for an explanatory variable. The latter forms a vector where all the elements are one. When a constant term is included, this vector leads to perfect collinearity. Thus, we cannot include a candidate in the case of the explanatory variable.
Excluding two special type vectors, we listed all the candidates for integrating vectors. In general, K categorical dummy variables are considered, the integrating vectors are K-tuple vectors, and their elements are one or zero with two exceptions: all the elements are equal to zero and all the elements are one case. To form K-tuple vectors where their elements are one or zero, we first transformed decimal numbers from 1 to 2K − 1 (2 to the K power minus one) to binary numbers. By creating a vector using each digit of the generated binary number as an element, a K-tuple vector can be formed by adding zeros to the beginning so that the total number of digits is K. Then, we can list up to 2K − 2 integrating vectors. These vectors exhaust all candidates for the integrated vector. However, as we discussed in Sect. 2, when a constant term is included in the regression analysis, for one explanatory variable candidate, there is a corresponding explanatory variable that expresses the same model. To remove the corresponding candidates, we removed the integrating vectors with the sum of elements greater than K/2 from the candidates. To summarize the above steps:
  • Step 1: Transform decimal numbers from 1 to 2K − 1 to binary numbers.
  • Step 2: Form the K-tuple vector by adding zeros to the beginning so that the total number of digits is K.
  • Step 3: Remove the vectors with the sum of elements greater than K/2 from the list of candidates.
Through these steps, there remain about 2K–1 candidates. To add zeros to the beginning of each vector, we replaced step 2 by reversing the vector order and forming K-tuple vectors by adding zeros to the end.

3.2 Picking up all the Combinations of the Candidates for Explanatory Variables

After listing about 2K–1 − 1 candidates of integrating vectors, about 2 K–1 − 1 variables can be found as candidates for the explanatory variables in regressions. When a constant term is included, it sometimes leads to perfect collinearity and is not the most parsimonious one if more than K − 1 variables are adopted from the list of the candidates for explanatory variables. To select the most parsimonious model by some criterion, we calculated the criterion by estimating the regression model with 1 through K − 1 combinations of the candidates for explanatory variables. At this step, adopting the same dummy variables twice or more in the form of integrated dummy variables should be avoided. To avoid such cases when two or more integrating vectors are adopted, the row sums of the matrix formed with adopted integrating (column) vectors should be checked. If one of the row sums of the matrix of the adopted integrating vectors is greater than one, such combinations of the integrated dummy variables are not adopted as a set of explanatory variables because some categorical dummy variables were integrated twice. By conducting regression estimation including the remaining combinations of the integrated dummy variables as explanatory variables, the most parsimonious model from the viewpoint of a variable selection criterion was found. The above steps can be summarized as follows:
  • Step 1: Set a number for picking up vectors from the candidates of integrating vectors (M). We start with M equals 1.
  • Step 2: List all the combinations to pick up M vectors from the candidates of integrating vectors.
  • Step 3: Form a matrix by stacking a picked-up integrating vector and calculating each row sum. If one of the row sums is larger than 2, we remove it from the list of the combinations of integrating vectors.
  • Step 4: Conduct regression estimation and calculate the variable selection criterion with explanatory variables formed by multiplying the matrix of the categorical dummy variables and combinations of integrating vectors.
Change the number for picking up vectors from 1 to K − 1 and repeat the steps 1 through 4. During these steps, find the most parsimonious model from the viewpoint of a variable selection criterion. Of course, even in this step, there remain some cases that represent the same data generating process with different combinations of the dummy variables. In such cases, the same value of the variable selection criteria can be obtained exactly, which means that this process is not completely efficient. During the actual variable selection process, when exactly the same value of variable selection criteria is obtained, it is not a problem to adopt a combination of variables that first gives the best combinations of the variables among them.

3.3 Skipping the Step of Picking up the Combinations of Explanatory Variables by Lasso

Lasso is a shrinkage method that imposes a penalty on each coefficient size as in ridge regression. This type of method can be adopted as a variable selection method. Standard lasso estimation yields a penalty on each coefficient size as in ridge regression:
$${\text{minimize }}\,{\text{RSS}} + \lambda \mathop \sum \limits_{i = 1}^{p} \left| {\beta_{i} } \right|$$
(13)
where RSS is the residual sum of squares and p and \(\left|{\beta }_{i}\right|\) are the number of estimated parameters and estimated coefficients of the regression, respectively. However, the lasso estimator may be inefficient and inconsistent. To overcome these problems, Zou [17] proposed an adaptive lasso:
$${\text{minimize}}\,{\text{RSS}} + \lambda \mathop \sum \limits_{i = 1}^{p} w_{i} \left| {\beta_{i} } \right|$$
(14)
Minimizing the object function, the true model can be reached when the sample size becomes infinite. Oracle properties discussed by several researchers, e.g., Zou [17], include this consistency. In practice, we adopt cross validation to minimize RSS for selecting optimal tuning parameter \(\uplambda\). Using estimated coefficients by ridge regression \(\left|{\beta }_{i}^{*}\right|\), the weighting parameter can be set as \(w_{i} = 1/\left| {\beta_{i}^{*} } \right|\) and cross validation can be applied to minimize RSS for selecting optimal tuning parameter \(\uplambda\) in adaptive lasso.
Instead of traditional information criteria that use sample size and numbers of the parameters to be estimated, recent lasso and adaptive lasso use the absolute magnitudes of the estimated parameters. Additionally, Hebiri and Lederer [18] showed that the correlations among explanatory variables affect the optimal tuning parameter in lasso estimation and showed that the cross validation provides a suitable choice in many applications. However, while the original categorical dummy variables are mutually uncorrelated, the integrated dummy variables, which were discussed in Sect. 4, exhibit correlations. This kind of change might introduce the results of the variable selection.
In the following simulation study in Sect. 5, we applied adaptive lasso estimation for the regression model with all the candidates for the explanatory variables. One of the most advantageous properties is that lasso estimation can be applied for too many explanatory variable cases even when the number of the explanatory variables is larger than the number of the observations and when the set of the explanatory variables is linearly dependent. The latter case implies that the variance–covariance matrix of the explanatory variables is degenerated. This property is advantageous because the picking-up process considered in subSect. 3.2 can be skipped. As a result, adaptive lasso estimation can be just applied with all the listed integrated categorical dummy variables as the explanatory variables at once. Additionally, Huang et al. [6] and Ohishi et al. [7] proposed lasso estimation for categorical data using adjacent properties like ordered categories. However, this method cannot be applied to unordered categorical variable cases.

4 Another Method that Utilizes the Method for Ordered Categorical Variable

If the category is a kind of ordered one, at first, the regression model with K ordered categorical dummy variables can be defined as follows:
$${y}_{i}=\sum_{k=1}^{\text{K}}{\beta }_{k}{Dum}_{ki}$$
(15)
where the constant term was removed to avoid perfect multicollinearity. Then, because the category is a kind of ordered one, new dummy variables were constructed that represent the accumulated effects of the jth category and over as follows:
$$ADum_{ji} = \sum\limits_{k = j}^{K} {Dum_{ki} , j = 1,2, \ldots ,K}$$
(16)
where \({ADum}_{1,i}\) becomes the variable for the constant term. Using these dummy variables and defining new parameters:
$$\delta_{k} = \beta_{k} - \beta_{k - 1} , k = 2,3, \ldots ,K$$
(17)
we can rewrite Eq. (15) with the accumulated dummy variables as follows:
$$y_{i} = \beta_{1} *ADum_{1i} + \sum\limits_{k = 2}^{K} {\left( {\beta_{k} - \beta_{k - 1} } \right)} *ADum_{ki} = \sum\limits_{k = 1}^{K} {\delta_{k} *ADum_{ki} }$$
(18)
where we set \({\delta }_{1}={\beta }_{1}\). Each \({\beta }_{j}\) can be estimated as follows:
$${\beta }_{j}=\sum_{j=1}^{j}{\delta }_{k}.$$
(19)
By applying the estimation method to the following equation:
$${y}_{i}=\sum_{k=1}^{\text{K}}{\delta }_{k}{ADum}_{ki}+{\varepsilon }_{i}\begin{array}{cc},& i=\text{1,2},3,\dots ,N\end{array}$$
(20)
the problems with model selection and integrating some categories (dummy variables) can be solved by minimizing Akaike’s Information Criterion (AIC) by Akaike [19, 20] or Schwarz’s Bayesian Information Criterion (BIC) by Schwarz [21] or other selection methods. This type of transformation is proposed by Tian et al. [22]. This type of solution can be easily applied for ordered categorical variables cases. However, for the unordered categorical variable cases, some practical scheme is needed. At first, we estimated the regression model (15) by OLS with or without some other explanatory variables. After that, using the estimated coefficients for categorical dummy variables \(\widehat{{\beta }_{j}} j=\text{1,2},\dots ,K\), we ordered the categorical dummy variables according to the estimated coefficient in ascending or descending order. This allowed us to rewrite Eq. (15) in the following form:
$${\widehat{y}}_{i}=\sum_{k=1}^{\text{K}}\widehat{{\beta }_{\left(k\right)}}{Dum}_{\left(k\right)i}$$
(21)
where \(\widehat{{\beta }_{\left(1\right)}}\le \widehat{{\beta }_{\left(2\right)}}\le \cdots \le \widehat{{\beta }_{\left(K\right)}}\) if we use ascending order. After this process, we considered these ordered categorical dummy variables: \({Dum}_{\left(k\right)i}, k=\text{1,2},\dots ,K\) as dummy variables for ordered categorical data and constructed new dummy variables:
$${\widehat{ADum}}_{ji}=\sum_{k=j}^{K}{Dum}_{\left(k\right)i}, j=\text{1,2},\dots ,K.$$
(22)
Using these dummy variables, we can apply information criteria or other variable selection procedures for the following model:
$$y_{i} = \sum\limits_{k = 1}^{{\text{K}}} {\delta_{k} \widehat{ADum}_{ji} + \varepsilon_{i} ,\,\,i = 1,2,3, \ldots ,N}$$
(23)

5 Simulation Study

5.1 Information Criteria

Before describing our simulation study, we explain the variable selection criteria and other methods. As Burnham and Anderson [16] surveyed, there are several criteria. One of the most popular information criteria is AIC:
$${\text{AIC}} = {\text{n}}*\ln \left( {\frac{{{\text{RSS}}}}{{\text{n}}}} \right) + 2{\text{p}}$$
(24)
where RSS is the residual sum of squares and n and p are the number of observations and estimated parameters when maximized likelihood estimation for the Gaussian linear regression model is conducted. The best model found by minimizing AIC is selected. However, AIC does not give the true model that generates data even when the sample size becomes infinite. More details can be found in the work of Burnham and Anderson [16]. This property is called inconsistency. Another popular but consistent information criterion is BIC:
$${\text{BIC}} = {\text{n}}*\ln \left({\frac{{{\text{RSS}}}}{{\text{n}}}} \right) + 2{\text{p}}*\ln \left(n \right)$$
(25)
where RSS is the residual sum of squares and n and p are the number of observations and estimated parameters, respectively. The best model is selected by minimizing BIC. Both AIC and BIC are the most popular information criteria and are adopted by several commands in software packages for regression analysis, e.g., R, STATA, and others.
Additionally, as mentioned in Sect. 3.3, we applied an adaptive lasso for selecting explanatory variables in this work. The comparative advantage of the adaptive lasso is explained in Sect. 3.3, where sample programs for R and other software packages are provided.

5.2 Simulation Set-Ups for the Case of Five Categorical Dummy Variables

In this section, we consider the simulation set-ups when the categorical variable has five categories. We used five categorical dummy variables as explanatory variables and set the distributions of the observation for each category. In this paper, we adopted two cases:
$$\begin{gathered} Case 1: \# \left\{ {i \in \left( {k{\text{th category}}} \right)} \right\} = \frac{{\text{N}}}{{\text{K}}} \hfill \\ Case 2: \# \left\{ {i \in \left( {k{\text{th category}}} \right)} \right\} = \frac{{\text{N}}}{{3{\text{K}}}}{\text{ if k is odd number}} \hfill \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\# \left\{ {i \in \left( {k{\text{th category}}} \right)} \right\} = \frac{{2{\text{N}}}}{{3{\text{K}}}}{\text{ if k is even number}} \hfill \\ \end{gathered}$$
(26)
where we assume K and N are even numbers. To conduct a simulation study, we set K = 5 and N = 300 or 3000. The two cases were assumed to originate from the difference in the variance of the explanatory variables on the variable selection. In case 1, the variance of the dummy variable was the same, but in case 2, the variances of the dummy variable between odd and even numbers were different.
Using the above settings, we first generated the dependent variable as follows:
$${y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}$$
(27)
where a generated random number for \({\varepsilon }_{i}\) acts as a standard normal distribution. This case assumes that all the coefficients are mutually different with the same discrepancies. If the selection criterion chooses the true number of the explanatory variables, the selected number becomes five including the constant term. Another data generating process was assumed as follows:
$${y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}.$$
(28)
In this case, if the selection criterion chooses the true number of explanatory variables, the selected number becomes three including the constant term. The other data generating process was assumed as follows:
$${y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}$$
(29)
In this case, if the selection criterion chooses the true number of explanatory variables, the selected number becomes two including the constant term. As for the settings about \({\alpha }_{1}\), we use in Setting 1: \({\alpha }_{1}=0.5\) and in Setting 2: \({\alpha }_{1}=5.0\). The two values of \({\alpha }_{1}\) were assumed to stem from the impact of the relative size between the size of the coefficient and the standard deviation of the error term. In the case of \({\alpha }_{1}=0.5\), the coefficient was smaller than the standard deviation of the error, and in the case of \({\alpha }_{1}=5.0\), the coefficient was more than twice the standard deviation of the error term.

5.3 Simulation Results

We first show the results of the proposed method in Sect. 3 for the selected number of explanatory variables including constant terms when we set N = 300 (relatively small sample cases) for six data generating set-ups with Case 1 dummy variables in Table 1. These results were obtained with 5000 replications. Table 2 presents the selected number of explanatory variables of the same set-ups with Case 2 dummy variables. Tables 3 and 4 report the same settings when we set N = 3000 (relatively large sample cases). These four tables report the results of the variable selection method by AIC and BIC and applied adaptive lasso with all the candidates of the explanatory variables; in this case, the number of the candidates is 15 (24 − 1), as in Sect. 3. The reported numbers in bold indicate the cases when the correct number of explanatory variables were selected. In these tables, we only report the selected number of explanatory variables to investigate the dimension consistency of the variable selection procedures, as Burnham and Anderson [16] pointed out.
Table 1
Simulation results of Case 1 when N = 300 with 5000 replications
 
Selected number of the explanatory variables including the constant term
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\)
 AIC
0
0
120
1578
3302
 BIC
0
4
1778
2802
416
 Adaptive lasso
0
0
4
141
4855
0
0
0
0
0
0
0
0
0
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\)
 AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
0
5000
0
0
0
0
0
0
0
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\)
 AIC
0
160
3867
949
24
 BIC
0
1538
3448
14
0
 Adaptive lasso
0
59
642
1845
2454
0
0
0
0
0
0
0
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\)
 AIC
0
0
3556
1324
120
 BIC
0
0
4824
174
2
 Adaptive lasso
0
0
818
1999
2183
0
0
0
0
0
0
0
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\)
 AIC
0
2999
1916
85
0
 BIC
0
4851
149
0
0
 Adaptive lasso
21
1096
1728
1262
889
4
0
0
0
0
0
0
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\)
 AIC
0
2762
1991
245
2
 BIC
0
4697
298
5
0
 Adaptive lasso
0
2011
1465
813
711
0
0
0
0
0
0
0
0
0
Table 2
Simulation results of Case 2 when N = 300 with 5000 replications
 
Selected number of the explanatory variables including the constant term
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=0.5\)
 AIC
0
28
1568
3140
264
 BIC
0
581
3265
1150
4
 Adaptive lasso
0
7
59
426
1729
2082
616
71
10
0
0
0
0
0
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=5.0\)
 AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
179
1952
2869
0
0
0
0
0
0
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{k}=0.5\)
 AIC
0
1217
3043
726
14
 BIC
0
3598
1381
21
0
 Adaptive lasso
34
210
752
1569
1509
707
101
47
22
30
17
2
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{k}=5.0\)
 AIC
0
0
2910
1809
281
 BIC
0
0
4738
257
5
 Adaptive lasso
0
0
384
1828
1731
643
360
41
13
0
0
0
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{k}=0.5\)
 AIC
0
2718
2008
269
5
 BIC
0
4690
309
1
0
 Adaptive lasso
45
436
1219
1589
1172
474
35
6
10
8
5
1
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{k}=5.0\)
 AIC
0
2282
2172
508
38
 BIC
0
4565
426
9
0
 Adaptive lasso
0
832
1855
1191
625
417
74
6
0
0
0
0
0
0
Table 3
Simulation results of Case 1 when N = 3000 with 5000 replications
 
Selected number of the explanatory variables including the constant term
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\)
 AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
0
5000
0
0
0
0
0
0
0
0
0
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\)
 AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
0
5000
0
0
0
0
0
0
0
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\)
 AIC
0
0
3584
1303
113
 BIC
0
0
4957
43
0
 Adaptive lasso
0
0
799
1960
2241
0
0
0
0
0
0
0
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\)
 AIC
0
0
3584
1303
113
 BIC
0
0
4957
43
0
 Adaptive lasso
0
0
1481
2545
974
0
0
0
0
0
0
0
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\)
 AIC
0
2779
1980
240
1
 BIC
0
4921
79
0
0
 Adaptive lasso
0
1713
1566
951
770
0
0
0
0
0
0
0
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\)
 AIC
0
2779
1980
240
1
 BIC
0
4921
79
0
0
 Adaptive lasso
0
2271
1501
821
407
0
0
0
0
0
0
0
0
0
Table 4
Simulation results of Case 2 when N = 3000 with 5000 replications
 
Selected number of the explanatory variables including the constant term
1
2
3
4
5
6
7
8
9
10
11
12
13
14
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=0.5\)
 AIC
0
0
1
243
4756
 BIC
0
0
354
2917
1729
 Adaptive lasso
0
0
0
150
1688
2938
146
78
0
0
0
0
0
0
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{k}=5.0\)
 AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
695
3375
930
0
0
0
0
0
0
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\)
 AIC
0
0
3119
1735
146
 BIC
0
121
4830
49
0
 Adaptive lasso
0
0
321
1665
1702
778
180
97
149
76
27
5
0
0
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\)
 AIC
0
0
2913
1809
278
 BIC
0
0
4922
78
0
 Adaptive lasso
0
0
452
2193
1816
434
93
12
0
0
0
0
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\)
 AIC
0
2283
2252
449
16
 BIC
0
4875
125
0
0
 Adaptive lasso
0
584
1441
1453
979
314
72
66
63
24
2
2
0
0
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\)
 AIC
0
2266
2171
526
37
 BIC
0
4875
125
0
0
 Adaptive lasso
0
1004
2043
1076
573
257
44
3
0
0
0
0
0
0
As regards the selected number of variables, the results from Tables 1, 2, 3, 4 can be summarized as follows.
BIC and AIC tend to impose stricter penalties compared with adaptive lasso when the magnitudes of all coefficients differ and the differences are relatively small, leading to the selection of models smaller than the true dimension. This tendency persists even when the sample size is 3000.
In cases where category integration is necessary, the results of variable selection by adaptive lasso tend to favor models larger than the true dimension, with a lower probability of selecting the true dimension compared with results based on AIC or BIC.
In Case 2, where the variances of explanatory variables differ, the results of variable selection by adaptive lasso tend to choose models larger than the true dimension, exhibiting a tendency to select more variables than the original number of categories, which is 5.
In cases where category integration is necessary, if the sample size is small and the differences in coefficients between variables are small, AIC tends to be higher than BIC. However, when the sample size is large or the differences in coefficients between variables are large, BIC has a higher probability of selecting the true dimension.
In the case of a sample size of 3000, regardless of the case or model, BIC has a higher probability of selecting the true dimension compared with other criteria.
As for the case utilizing the nature of ordered categorical dummy variables proposed in Sect. 4, Table 5 for Case 1 and Table 6 for Case 2 report the simulation results in the same set-ups of Tables 1, 2, 3, 4. The results from Tables 5 and 6 compared with Tables 1, 2, 3, 4 can be summarized as follows.
Table 5
Simulation results of a selected number of aggregated dummy variables including constant term of Case 1 and Case 2 when N = 300 with 5000 replications
Case 1
Case 2
 
1
2
3
4
5
 
1
2
3
4
5
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\)
 AIC
0
0
120
1578
3302
AIC
0
328
2962
1705
5
 BIC
0
4
1772
2808
416
BIC
6
547
4065
382
0
 Adaptive lasso
0
1
71
728
4200
Adaptive lasso
1
102
1423
2455
1019
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\)
 AIC
0
0
0
0
5000
AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
0
5000
Adaptive lasso
0
0
0
0
5000
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\)
 AIC
0
159
3844
973
24
AIC
30
1120
3330
501
19
 BIC
0
1476
3500
24
0
BIC
596
2196
2174
33
1
 Adaptive lasso
0
170
1820
2011
999
Adaptive lasso
56
836
1659
1839
610
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\)
 AIC
0
0
3556
1324
120
AIC
0
0
4762
234
4
 BIC
0
0
4824
174
2
BIC
0
0
4999
1
0
 Adaptive lasso
0
0
4813
145
42
Adaptive lasso
0
0
3400
1227
373
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\)
 AIC
7
2997
1762
225
9
AIC
52
1619
2763
530
36
 BIC
122
4680
189
9
0
BIC
634
2481
1836
48
1
 Adaptive lasso
24
1918
1796
926
336
Adaptive lasso
72
998
1761
1496
673
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\)
 AIC
0
2839
1825
324
12
AIC
0
1256
3294
435
15
 BIC
0
4721
265
14
0
BIC
0
2513
2457
30
0
 Adaptive lasso
0
4680
216
73
31
Adaptive lasso
0
1527
1991
698
784
Table 6
Simulation results of a selected number of aggregated dummy variables including constant term of Case 1 and Case 2 when N = 3000 with 5000 replications
Case 1
Case 2
 
1
2
3
4
5
 
1
2
3
4
5
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=0.5\)
 AIC
0
0
0
0
5000
AIC
0
0
3
331
4666
 BIC
0
0
0
0
5000
BIC
0
0
767
2819
1414
 Adaptive lasso
0
0
0
0
5000
Adaptive lasso
0
0
3
178
4819
DGP: \({y}_{i}=\sum_{k=1}^{5}\left({\alpha }_{1}k\right)*{Dum}_{ki}+{\varepsilon }_{i}\) and \({\alpha }_{1}=5.0\)
 AIC
0
0
0
0
5000
AIC
0
0
0
0
5000
 BIC
0
0
0
0
5000
BIC
0
0
0
0
5000
 Adaptive lasso
0
0
0
0
5000
Adaptive lasso
0
0
0
0
5000
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=0.5\)
 AIC
0
0
3584
1303
113
AIC
0
2
3729
1210
59
 BIC
0
0
4957
43
0
BIC
0
5
4965
30
0
 Adaptive lasso
0
0
3579
1091
330
Adaptive lasso
0
2
2625
1818
555
DGP:\({y}_{i}={\alpha }_{1}*{Dum}_{1i}+2{\alpha }_{1}*\left({Dum}_{2i}+{Dum}_{3i}\right)+3{\alpha }_{1}*\left({Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i} \text{and }{\alpha }_{1}=5.0\)
 AIC
0
0
3584
1303
113
AIC
0
0
4773
226
1
 BIC
0
0
4957
43
0
BIC
0
0
5000
0
0
 Adaptive lasso
0
0
5000
0
0
Adaptive lasso
0
0
4792
192
16
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=0.5\)
 AIC
0
2872
1810
295
23
AIC
0
1897
2388
676
39
 BIC
0
4925
73
2
0
BIC
0
2400
2584
16
0
 Adaptive lasso
0
3245
1141
468
146
Adaptive lasso
0
1063
2278
1302
357
DGP: \({y}_{i}={\alpha }_{1}*\left({Dum}_{1i}+{Dum}_{2i}\right)+2{\alpha }_{1}*\left({Dum}_{3i}+{Dum}_{4i}+{Dum}_{5i}\right)+{\varepsilon }_{i}\) \(\text{and }{\alpha }_{1}=5.0\)
 AIC
0
2872
1810
295
23
AIC
0
0
4492
491
17
 BIC
0
4925
73
2
0
BIC
0
0
4997
3
0
 Adaptive lasso
0
5000
0
0
0
Adaptive lasso
0
2382
2480
86
52
When utilizing the nature of ordered categorical dummy variables, BIC and AIC tend to impose stricter penalties compared with adaptive lasso when the magnitudes of all coefficients differ, and the differences are relatively small. This tendency persists even when the sample size is 3,000. However, the results are not significantly different from the results of Tables 1 and 2.
From these results, we suggest that, except for cases where the magnitudes of all coefficients differ and their differences are relatively small, BIC performs better in selecting the true dimension of the model compared with the results of AIC and adaptive lasso.
When utilizing the nature of ordered categorical dummy variables, in Case 1, the probabilities of selecting the true dimension were not significantly different when comparing the results of Table 5 with those of Table 1, or when comparing the results of Table 6 with those of Table 3. These results imply that the highest probability of selecting the true dimension by BIC is like those from Tables 1, 2, 3, 4 and suggest the practicality of utilizing the property of the ordered categorical dummy variables.
When utilizing the nature of ordered categorical dummy variables, in Case 2, the probabilities of selecting the true dimension were higher in Tables 2 and 4 compared with the results of comparing Table 5 with Table 2, or Table 6 with Table 4. This suggests that when the variances of variables differ, the ordering of dummy variables by least squares in the first stage may not work well and it makes selecting the true dimension by AIC or BIC difficult.
The findings suggest that utilizing the properties of ordered categorical dummy variables for variable selection and integration is practical in cases where the number of observations per category is roughly the same. However, this method is not practical when the number of observations per category varies significantly.

6 Discussion

In this work, we propose a new variable selection procedure including the integration of some variables for categorical dummy variables in regression analysis.
Picking up all the candidates of the explanatory variables and possible combinations of them, we searched for the optimal model by AIC and BIC. In this procedure, BIC performed relatively well, but it took too much time practically. Another procedure to skip choosing possible combinations of the explanatory variables and to search an optimal model among all the candidates of the explanatory variables by lasso did not perform well. In some cases, the other procedure to utilize the property of the ordered categorical dummy variables showed similar performance to the procedure that searched for the optimal combinations. This result depends on the first step that orders the categorical dummy variables by estimated coefficients. Additionally, in this work, we checked the performances of adaptive lasso estimation. In most cases, from the dimension consistency point of view, adaptive lasso estimation did not perform better than minimizing the BIC procedure.
Finally, from a practical point of view, the proposed method to search for a minimum AIC model or minimum BIC model required much time. When the categories become more than 10 or there are several types of categories as explanatory variables in regression analysis, it takes too much time to find the optimal model. We should proceed the improving picking-up the combinations of the candidate variables more efficiently or adopt a model selection method that imposed a harder penalty in an adaptive lasso-like method. Additionally, in this paper, we focus only on the dimension consistency of the variable selection procedures, but there remains a possibility that the selected dummy variable can be converted to a dummy variable set that is easier to understand, but this problem is closely related to what kind of category it is applied, and is considered to be a case-by-case problem.

Acknowledgements

This work is partially supported by the Japan Society for the Promotion of Science (JSPS), Grant-in-Aid for Scientific Research (B) 23H00806.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by-nc-nd/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
1.
Zurück zum Zitat Walter SD, Feinstein AR, Wells CK (1987) Coding ordinal independent variables in multiple regression analyses. Am J Epidemiology 125(2):319–323CrossRef Walter SD, Feinstein AR, Wells CK (1987) Coding ordinal independent variables in multiple regression analyses. Am J Epidemiology 125(2):319–323CrossRef
2.
Zurück zum Zitat Alkharusi H (2012) Categorical variables in regression analysis: a comparison of dummy and effect coding. Int J Edu 4(2):202CrossRef Alkharusi H (2012) Categorical variables in regression analysis: a comparison of dummy and effect coding. Int J Edu 4(2):202CrossRef
3.
Zurück zum Zitat Venkataramana M, Subbarayudu M, Rajani M, Sreenivasulu KN (2016) Regression analysis with categorical variables. Int J Stat Sys 11:135–143 Venkataramana M, Subbarayudu M, Rajani M, Sreenivasulu KN (2016) Regression analysis with categorical variables. Int J Stat Sys 11:135–143
4.
Zurück zum Zitat Anderson JA (1984) Regression and ordered categorical variables. J Royal Stat Soc B46(1):1–22 Anderson JA (1984) Regression and ordered categorical variables. J Royal Stat Soc B46(1):1–22
5.
Zurück zum Zitat Gertheiss J, Tutz G (2009) Penalized regression with ordinal predictors. Int Stat Rev 77:345–365CrossRef Gertheiss J, Tutz G (2009) Penalized regression with ordinal predictors. Int Stat Rev 77:345–365CrossRef
6.
Zurück zum Zitat Huang L, Hang W, Chao Y (2020) High-dimensional regression with ordered multiple categorical predictors. Stat Med 39:294–309CrossRef Huang L, Hang W, Chao Y (2020) High-dimensional regression with ordered multiple categorical predictors. Stat Med 39:294–309CrossRef
7.
Zurück zum Zitat Ohishi M, Okamura K, Itoh Y, Yanagihara H (2021) Optimizations for categorizations of explanatory variables in linear regression via generalized fused lasso. Intelligent decision technologies: proceedings of the 13th KES-IDT 2021 conference. Springer, Singapore, pp 457–467CrossRef Ohishi M, Okamura K, Itoh Y, Yanagihara H (2021) Optimizations for categorizations of explanatory variables in linear regression via generalized fused lasso. Intelligent decision technologies: proceedings of the 13th KES-IDT 2021 conference. Springer, Singapore, pp 457–467CrossRef
8.
Zurück zum Zitat Fukushige M (2024) Variable selection for ordered categorical data in regression analysis: information criteria vs. lasso. Res Stat 2(1):2382484CrossRef Fukushige M (2024) Variable selection for ordered categorical data in regression analysis: information criteria vs. lasso. Res Stat 2(1):2382484CrossRef
9.
Zurück zum Zitat Andersen EB (1994) The statistical analysis of categorical data, 3rd edn. Springer, New YorkCrossRef Andersen EB (1994) The statistical analysis of categorical data, 3rd edn. Springer, New YorkCrossRef
10.
Zurück zum Zitat Andersen EB (1997) Introduction to the statistical analysis of categorical data. Springer, New YorkCrossRef Andersen EB (1997) Introduction to the statistical analysis of categorical data. Springer, New YorkCrossRef
11.
12.
Zurück zum Zitat Yan X, Su X (2009) Linear regression analysis: theory and computing. World Scientific, SingaporeCrossRef Yan X, Su X (2009) Linear regression analysis: theory and computing. World Scientific, SingaporeCrossRef
13.
Zurück zum Zitat Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B68:49–67CrossRef Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc B68:49–67CrossRef
14.
Zurück zum Zitat Detmer FJ, Cebral J, Slawski M (2020) A note on coding and standardization of categorical variables in (sparse) group lasso regression. J Stat Plan Inference 206:1–11CrossRef Detmer FJ, Cebral J, Slawski M (2020) A note on coding and standardization of categorical variables in (sparse) group lasso regression. J Stat Plan Inference 206:1–11CrossRef
15.
Zurück zum Zitat Wang H, Leng C (2008) A note on adaptive group lasso. Compu Stat Data Anal 52:5277–5286CrossRef Wang H, Leng C (2008) A note on adaptive group lasso. Compu Stat Data Anal 52:5277–5286CrossRef
16.
Zurück zum Zitat Burnham K, Anderson D (2002) Model selection and multi-model inference: a practical information-theoretic approach, 2nd edn. Springer, New York Burnham K, Anderson D (2002) Model selection and multi-model inference: a practical information-theoretic approach, 2nd edn. Springer, New York
17.
Zurück zum Zitat Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429CrossRef Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101:1418–1429CrossRef
18.
Zurück zum Zitat Hebiri M, Lederer J (2012). How correlations influence lasso prediction. IEEE Trans Inform Theory 59:1846–1854. Hebiri M, Lederer J (2012). How correlations influence lasso prediction. IEEE Trans Inform Theory 59:1846–1854.
19.
Zurück zum Zitat Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Caski F (eds) Proceedings of the 2nd international symposium on information theory. Akadimiai Kiado, Budapest, pp 267–281 Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Petrov BN, Caski F (eds) Proceedings of the 2nd international symposium on information theory. Akadimiai Kiado, Budapest, pp 267–281
20.
Zurück zum Zitat Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Cont 19:716–723CrossRef Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Cont 19:716–723CrossRef
21.
Zurück zum Zitat Schwarz G (1978) Estimating the dimension of a model. Anal Stat 6:461–464 Schwarz G (1978) Estimating the dimension of a model. Anal Stat 6:461–464
22.
Zurück zum Zitat Tian H, Huang L, Cheng CY, Zhang L (2018) Regression models with ordered multiple categorical predictors. J Stat Comp Simul 88(16):3164–3178CrossRef Tian H, Huang L, Cheng CY, Zhang L (2018) Regression models with ordered multiple categorical predictors. J Stat Comp Simul 88(16):3164–3178CrossRef
Metadaten
Titel
Variable Selection and Variable Integration for Categorical Dummy Variables in Regression Analysis
verfasst von
Mototsugu Fukushige
Publikationsdatum
17.05.2025
Verlag
Springer Berlin Heidelberg
Erschienen in
Annals of Data Science
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-025-00607-x