Published in:

Open Access 01-03-2015

Relative efficiency appraisal of discrete choice modeling algorithms using small-scale maximum likelihood estimator through empirically tailored computing environment

Authors: Hyuk-Jae Roh, Prasanta K. Sahu, Ata M. Khan, Satish Sharma

Published in: Railway Engineering Science | Issue 1/2015

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

Discrete choice models are widely used in multiple sectors such as transportation, health, energy, and marketing, etc., where the model estimation is usually carried out by using commercial software. Nonetheless, tailored computer codes offer modellers greater flexibility and control of unique modelling situation. Aligned with empirically tailored computing environment, this research discusses the relative performance of six different algorithms of a discrete choice model using three key performance measures: convergence time, number of iterations, and iteration time. The computer codes are developed by using Visual Basic Application (VBA). Maximum likelihood function (MLF) is formulated and the mathematical relationships of gradient and Hessian matrix are analytically derived to carry out the estimation process. The estimated parameter values clearly suggest that convergence criterion and initial guessing of parameters are the two critical factors in determining the overall estimation performance of a custom-built discrete choice model.

1 Introduction

It is common practice to carry out mathematical modeling for transportation planning, design, and operation by using readily available commercial software packages. Generally, the estimation processes are embedded in the software while developing the software package. This leads to constraints for users due to lack of information on the underlying analytical procedure for the model parameter estimation. This practice also applies to discrete choice model parameter estimation. There has been a general consensus that transportation modellers do not need to put extra efforts to come up with tailored computer codes dedicated solely for their unique modelling need. This may work in most cases; however, it may fail to work for very specific cases. In this context, the lack of flexibility in the computing environment could be exemplified through a situation, where the researchers choose ‘non-linear-in-parameters’ specification. Commercial packages cannot handle such a specification, and thus, the researcher might be forced to write a specific computer code dedicated only for that work [1, 2].

Advances in statistical computing language are readily available and could be applied as modeling base to all estimation procedures in a systematic way. A number of suitable computing languages for model estimation have improved significantly over the past years from, when reliance was almost exclusively on commercial software. As the statistical computing language of modeling has evolved significantly, more sophisticated modeling procedures could now be tailored by researchers in accordance with the level of transport modeling analysis. In an effort to relax limitations of software, researchers are strongly encouraged to use their own modeling computer codes [3]. On the basis of a clear understanding of analytical procedures that are repeatedly taking place during the estimation processes, researchers would be able to use different estimation algorithms and empirically modify/improve some portions of estimation processes. Eventually, with diverse numerical experiments, this repetitive task will help in proposing a new advanced/improved estimation procedure based on certain performance criterion.

Additionally, transportation researchers have not paid much attention on how to go beyond the limitation of using commercial packages in terms of carrying out entire modelling procedures with tailored computing environments [4]. This purpose is pursued in this research with a small scale discrete choice model, namely the binary logit model. The main objective of this research is to implement all analytical procedures involved in discrete choice modelling with visual basic application (VBA) and to present results of numerical experiments solely through possible tailored modeling. The other objectives of this research are (a) to identify the factors affecting the model estimation performance and (b) to suggest potential research topics related to custom-built modeling.

2 Derivation of multinomial logit

This section presents a binary logit choice function derived from the multinomial logit (MNL) choice function. The binary logit model is used as the target choice function for experimentation in this research. The derivation procedure adopted here helps in understanding not only the model’s limitations but also the conceptual background based on which the model is built. Consequently, researchers can differentiate MNL from other types of discrete choice models with utmost clarity and may try with a variety of experimental model estimation trials. Based on the research by Train [3] and Ben-Akiva and Lermans [5], the algebraic manipulation to get a final logit choice function is described here in details. MNL model, which is a prototype model of discrete choice models, is formulated based on the random utility concept. The probability that any alternative in C _n is chosen by decision makers can be formulated as in Eq. 1. C _n is actual choice set that the decision maker (n) can consider in the given choice situation.

$$ P_{n} \left( i \right) = P(U_{ni} > U_{nj} ,\quad\forall j \in C_{n} ,i \ne j), $$

(1)

where P _n(i) is the probability of choosing a alternative i by decision maker n; U _ni and U _nj are representative utility of the alternatives i and j of the decision n, ∀j ∊ C _n, i ≠ j indicates comparison of all alternatives except for the same alternative. Based on assumption, the utility can be divided into systematic and random parts (shown in Eq. 2).

$$ P_{n} \left( i \right) = P\left( {S_{ni} + R_{ni} > S_{nj} + R_{nj} ,\forall j \in C_{n} , i \ne j} \right), $$

(2)

$$ = P\left( {R_{nj} < S_{ni} - S_{nj} + R_{ni} ,\forall j \in C_{n} , i \ne j} \right), $$

(3)

where S _ni is the systematic (or deterministic, observable, constant) part of the representative utility of alternative i when faced by decision maker n and is observable to a researcher, and R _ni is the random (unobservable, error) part of the representative utility of alternative i when faced by decision maker n and is unobservable to a researcher. Usually, various discrete choice models are derived from Eq. (3), which depends on the mathematical distribution (e.g., Gumbel Type I, and Weibull, etc.)that the random component (R _ni, R _nj) follows. Therefore, using density function of Gumbel Type I (or Weibull), the density for each random parts relation is shown in Eq. 4 and the cumulative distribution is presented in Eq. 5.

$$ f\left( {R_{na} } \right) = e^{{ - R_{na} }} e^{{ - e^{{ - R_{na} }} }} , $$

(4)

$$ F\left( {R_{na} } \right) = e^{{ - e^{{ - R_{na} }} }} . $$

(5)

In Eq. (3), if R _ni is not known, then the choice probability of P _n(i) is a cumulative distribution, whose value is determined at S _ni– S _nj + R _ni, as given in Eq. (5).

$$ F\left( {R_{nj} } \right) = e^{{ - e^{{ - (S_{ni} - S_{nj} + R_{ni} )}} }} , $$

(6)

where R _nj is distributed independently for all alternatives, i ≠ j and it is logical that the Eq. 3 means that the choice probability is the product of the individual value of cumulative distribution obtained using Eq. 6. Equation 7 presents this function.

$$ P_{n} \left( i \right)|R_{ni} = \prod\nolimits_{{_{j \ne i} }} {e^{{ - e^{{ - (S_{ni} - S_{nj} + R_{ni} )}} }} } . $$

(7)

As mentioned earlier, R _ni is unknown and is assumed to have a distribution. By integration of P _n(i)|R _ni, P _n(i) will be determined as shown in Eq. 8.

$$ P_{n} \left( i \right) = \mathop \int \nolimits \left( {\prod\nolimits_{j \ne i} {e^{{ - e^{{ - \left( {S_{ni} - S_{nj} + R_{ni} } \right)}} }} } } \right)(e^{{ - R_{ni} }} e^{{ - e^{{ - R_{ni} }} }} ) {\text d}R_{ni} . $$

(8)

Some more algebraic interpolations need to be made to reach the final form of logit choice function. The remained processes are given below:

$$ P_{n} \left( i \right) = \int_{{R_{ni = - \infty } }}^{\infty } {\left( {\prod\nolimits_{j \ne i} {e^{{ - e^{{ - \left( {S_{ni} - S_{nj} + R_{ni} } \right)}} }} } } \right)\left( {e^{{ - R_{ni} }} e^{{ - e^{{ - R_{ni} }} }} } \right) {\text d}R_{ni} } , $$

(9)

$$ \,\,\,= \int_{{R_{ni = - \infty } }}^{\infty } {\left( {\prod\nolimits_{j \ne i} {e^{{ - e^{{ - \left( {S_{ni} - S_{nj} + R_{ni} } \right)}} }} } } \right)\left( {e^{{ - R_{ni} }} e^{{ - e^{{ - (S_{ni} - S_{ni + } R_{ni} )}} }} } \right) {\text d}R_{ni} } , $$

(10)

$$ \,\,\,= \int_{{R_{ni = - \infty } }}^{\infty } {\left( {\prod\nolimits_{j} {e^{{ - e^{{ - \left( {S_{ni} - S_{nj} + R_{ni} } \right)}} }} } } \right)\left( {e^{{ - R_{ni} }} } \right) {\text d}R_{ni} } , $$

(11)

$$ = \int_{{R_{ni = - \infty } }}^{\infty } {\left( {e^{{ - e^{{ - \left( {S_{ni} - S_{n1} + R_{ni} } \right)}} }} e^{{ - e^{{ - \left( {S_{ni} - S_{n2} + R_{ni} } \right)}} }} \cdots e^{{ - e^{{ - \left( {S_{ni} - S_{ni} + R_{ni} } \right)}} }} } \right)\left( {e^{{ - R_{ni} }} } \right) {\text d}R_{ni} } , $$

(12)

$$ = \int_{{R_{ni = - \infty } }}^{\infty } {\left( {e^{{ - e^{{ - \left( {S_{ni} - S_{n1} + R_{ni} } \right)}} - e^{{ - \left( {S_{ni} - S_{n2} + R_{ni} } \right)}} \cdots - e^{{ - \left( {S_{ni} - S_{ni} + R_{ni} } \right)}} }} } \right)\left( {e^{{ - R_{ni} }} } \right) {\text d}R_{ni} .} $$

(13)

The Eq. (13) can be simplified as follows:

$$ P_{n} \left( i \right) = \int_{{R_{ni = - \infty } }}^{\infty } {\left( {e^{{ - \sum\nolimits_{j} {e^{{ - \left( {S_{ni} - S_{nj} + R_{ni} } \right)}} } }} } \right)\left( {e^{{ - R_{ni} }} } \right) {\text d}R_{ni} } , $$

(14)

$$\,\,\, = \int_{{R_{ni = - \infty } }}^{\infty } {\left( {e^{{ - e^{{ - R_{ni} }} \sum\nolimits_{j} {e^{{ - \left( {S_{ni} - S_{nj} } \right)}} } }} } \right)\left( {e^{{ - R_{ni} }} } \right) {\text d}R_{ni} } . $$

(15)

Let $ e^{{ - R_{ni} }} = Z, $ and after differentiating with respect to R _ni, it becomes $ - e^{{ - R_{ni} }} \text{d}R_{ni} = \text{d}Z $. It is noted that if R _ni → ∞ then, Z → ∞, if R _ni → ∞ then Z → 0. Replacing these terms in Eq. 15, the final logit choice function is obtained as below (See Eq. 18).

$$ P_{n} \left( i \right) = \int_{Z = 0}^{\infty } {e^{{ - Z\sum\nolimits_{j} {e^{{ - \left( {S_{ni} - S_{nj} } \right)}} } }} {\text d}Z} , $$

(16)

$$\,\,\, = \left[ {\frac{{e^{{ - Z\mathop \sum \nolimits_{j}^{{}} e^{{ - \left( {S_{ni} - S_{nj} } \right)}} }} }}{{ - \mathop \sum \nolimits_{j}^{{}} e^{{ - \left( {S_{ni} - S_{nj} } \right)}} }}} \right]_{0}^{\infty } , $$

(17)

$$\,\,\, = \frac{{e^{{S_{ni} }} }}{{\mathop \sum \nolimits_{j}^{{}} e^{{S_{nj} }} }}. $$

(18)

The final form of probabilistic MNL model consists of two parts. The denominator represents the total utility of all the alternatives considered in the choice set and theutility of a particular alternative is given in the numerator. The binary logit model is utilized as numerical experimental base in this research, which can be simplified from Eq. 18 by assuming that only two alternatives are available in the mode choice market. Assuming two alternatives (auto and transit) are available; the Eq. 18 can be simplified into the binary logit choice formulation as shown in Eq. 19, where S _na and S _nt are the systematic part of the representative utility of alternative auto and transit, respectively.

$$ P_{n} \left( {autoortransit} \right) = \frac{{e^{{S_{{n{\text{a}}}} }} (or\, e^{{S_{{n{\text{t}}}} }} )}}{{e^{{S_{{n{\text{a}}}} }} + e^{{S_{{n{\text{t}}}} }} }} . $$

(19)

3 Methodology

3.1 Data, utility, likelihood, and log-likelihood function

Numerical experiments, in this research, are carried out using simple database that are successfully applied by Ben-Akiva and Lermans [5] for model parameter estimation of a given model specification. The reason behind this is, that a proven data base provides guarantee for successful convergence of the experimental model run. Additionally, the value of parameter estimates generated through experimental runs can be compared to true values and hence it may be concluded that all experimental runs are conducted on the right estimation track. Data plugged in model specification within tailored computing environments allow research to keep track of computational efforts: such as log-likelihood values generated for each observation until model convergence is achieved.

The model specification exemplified in a Ref. [5] is adopted in this research in order to compare models in the exact same conditions. Two variables are included in model specification. The first one is a generic variable belonging to both modes (i.e., auto and transit) and the other one is a mode-specific constant for auto mode (since the transit mode is designated as the reference mode). In other words, T _A and T _T are included as service-level attributes in the utility function for these two modes. Consequently, the final model specifications are shown as below in Eqs. 20 and 21

$$ V_{An} = \beta_{1} + \beta_{2} T_{A} , $$

(20)

$$ V_{Tn} = \beta_{2} T_{T} . $$

(21)

where V _An and V _Tn are the utility functions of auto and Transit. $ \beta_{1} \;\text{and}\;{\kern 1pt} \beta_{2} $ are model parameters to be estimated in the computation process.

In the next step, we considered the estimation method, which is one of core parts in custom-built modeling to get an in-depth understanding about the data involvement in discrete choice analysis. We applied the maximum likelihood function throughout the research: developing the computer code, deriving formulas for elements of Hessian matrix, and Gradient vector etc. Newton–Raphson estimation algorithm is employed mainly because of its excellence in convergence characteristics, and, more importantly, its higher level of accuracy for the parameter estimated. Data used for this research is assumed to be selected at random from a population and the likelihood function of the entire sample is the product of the likelihoods of each observation. Therefore the final form of the function specified for this research is given in Eq. 22. We used the log-likelihood function for the ease of computation and this is given in Eq. 23. The binary logit choice function tailored formulations for the two modes (auto, transit) are given in Eq. 24.

$$ L\left( {\beta_{1} ,\beta_{2} } \right) = \prod\nolimits_{n = 1}^{21} {P_{n} (A)^{{y_{An} }} P_{n} (T)^{{y_{Tn} }} ,} $$

(22)

$$ LL\left( {\beta_{1} ,\beta_{2} } \right) = \sum\nolimits_{n = 1}^{21} {[y_{An} lnP_{n} \left( A \right) + y_{Tn} lnP_{n} \left( T \right)],} $$

(23)

$$ P_{n} \left( A \right) = \frac{{e^{{V_{An} }} }}{{e^{{V_{An} }} + e^{{V_{An} }} }},\,P_{n} \left( T \right) = \frac{{e^{{V_{Tn} }} }}{{e^{{V_{An} }} + e^{{V_{An} }} }}. $$

(24)

Replacing Eqs. 20, 21 and 24 in 23, the final form of log-likelihood function specified for this experimental study is presented in Eq. 25. This equation is used as a base equation for obtaining a variety of formulas with which a tailored-computer code is developed.

$$ LL\left( {\beta_{1} ,\beta_{2} } \right) = \sum\nolimits_{n = 1}^{21} {\left[y_{An} {\text ln}\left( {\frac{{e^{{\beta_{1} + \beta_{2} T_{A} }} }}{{e^{{\beta_{1} + \beta_{2} T_{A} }} + e^{{\beta_{2} T_{T} }} }}} \right) \,\,\,+\,y_{Tn} {\text ln}\left( {\frac{{e^{{\beta_{2} T_{A} }} }}{{e^{{\beta_{1} + \beta_{2} T_{A} }} + e^{{\beta_{2} T_{T} }} }}} \right)\right].} $$

(25)

3.2 Derivation of gradient and Hessian formulas

The gradient of the log-likelihood function, which is a vector of the first derivatives of Eq. 25, plays an important role in determining the direction for parameter estimation. We derived the first-order partial derivatives of the Eq. 25; knowing the importance of the estimation procedure for developing the computer code. The most critical part of developing a computer program for estimation is to understand the relationship between the data and derived formulations used for estimation. Particularly, in discrete choice model, the functional form of the model looks simple; however mapping a relationship between the formulae and data set from the operational perspective is vital to define. In addition, the expansion of the given log-likelihood function is computationally extensive and costly [4]. In this section, we have presented only the final formulation of the gradient vector, which has been derived by using partial derivatives in analytical terms. The elements included in a gradient vector can be described as below in the mathematical form by Eqs. 26 and 27.

$$ \frac{{\partial LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{1} }} = \sum\nolimits_{n = 1}^{21} {[y_{An} P_{n} \left( T \right) - y_{Tn} P_{n} \left( A \right)]} , $$

(26)

$$ \frac{{\partial LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{2} }} = \sum\nolimits_{n = 1}^{21} {[y_{An} P_{n} \left( T \right)(T_{An} - T_{Tn} ) \,\,-\,y_{Tn} P_{n} \left( A \right)(T_{Tn} - T_{An} )]} . $$

(27)

We calculated the real gradient values through the estimation process and the value is constantly updated by new values until iteration process is terminated with final parameters estimates. A gradient vector of log-likelihood function can be summarized in a vector (column vector) form as shown on Eq. 28. In the context of this research, the column vector includes the two explanatory variables.

$$ G_{t} = \frac{{\partial LL\left( {\beta_{1} ,\beta_{2} } \right)}}{{\partial \beta_{k} }} = \left[ {\begin{array}{*{20}c} {\frac{{\partial LL\left( {\beta_{1} ,\beta_{2} } \right)}}{{\partial \beta_{1} }}} \\ {\frac{{\partial LL\left( {\beta_{1} ,\beta_{2} } \right)}}{{\partial \beta_{2} }}} \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {\sum\nolimits_{n = 1}^{21} {\left[ {y_{An} P_{n} \left( T \right) - y_{Tn} P_{n} \left( A \right)} \right]} } \\ {\sum\nolimits_{n = 1}^{21} {\left[ {y_{An} P_{n} \left( T \right)\left( {T_{An} - T_{Tn} } \right) - y_{Tn} P_{n} \left( A \right)\left( {T_{Tn} - T_{An} } \right)} \right]} } \\ \end{array} } \right]. $$

(28)

Finally, the mathematical expression of a gradient vector can be translated into computer code based on the interpretation implied in the above formula expansion and its numerical value is continuously changed during iterations until the process is terminated.

In order to employ Newton–Raphson estimation algorithm as an iteration method in custom-built discrete choice modeling, it is needed to derive the second-order partial derivatives of the log-likelihood formulation for the given choice situation. Therefore it becomes obvious for the researchers (or modellers) to derive analytically the second-order partial derivatives of the log-likelihood function [4], which contains full information associated with sample used in the analysis. Indeed, the process involved in the analysis is extremely troublesome and computationally intensive; however, there is no commercial software that offers the same level of understanding to researchers. Also, the experience gained practically in a way that all procedures associated with model estimation are mimicked analytically. In this context, the Newton–Raphson algorithm combined with the real Hessian matrix shows the best estimation performance in discrete mode choice modeling compared to any other algorithm operated by approximate Hessian matrix such as BHHH, BHHH-2, BFGS to name a few [3]. Based on the knowledge of the Hessian matrix symmetry, only the off-diagonal elements of Hessian matrix are treated mathematically. The following part of this section is mainly focused on the description of the formulas resulted from the second-order partial derivatives of the log-likelihood function for the off-diagonal elements of the Hessian matrix. For the purpose of simplicity, off-diagonal elements included in the Hessian matrix is formulated in the forms of relations given by Eqs. 29,30 and 31 by calculating the second-order derivatives with respect to β ₁ β ₁, β ₂ β ₂ (or β ₁ β ₁ and β ₂ β ₂). The formulations are shown below.

$$ \frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{1} \partial \beta_{1} }} = \sum\nolimits_{n = 1}^{21} {[ - y_{An} P_{n} \left( T \right)P_{n} \left( A \right) - y_{Tn} P_{n} \left( A \right)P_{n} \left( T \right)]} , $$

(29)

$$ \frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{1} \partial \beta_{2} (or\partial \beta_{1} \partial \beta_{2} )}} = \sum\nolimits_{n = 1}^{21} {[y_{An} P_{n} \left( T \right)P_{n} \left( A \right)\left( {T_{Tn} - T_{An} } \right) \,\,\,\,\,\,\,\,-y_{Tn} P_{n} \left( T \right)P_{n} \left( A \right)\left( {T_{An} - T_{Tn} } \right)]} , $$

(30)

$$ \frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{2} \partial \beta_{2} }} = \sum\nolimits_{n = 1}^{21} {[ - y_{An} P_{n} \left( T \right)P_{n} \left( A \right)\left( {T_{An} - T_{Tn} } \right)^{2} \,\,-\,y_{Tn} P_{n} \left( T \right)P_{n} \left( A \right)\left( {T_{An} - T_{Tn} } \right)^{2} ]} $$

(31)

The three elements consist of Hessian matrix is given in Eq. 32.

$$ H_{t} = \left[ {\begin{array}{*{20}c} {\frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{1} \partial \beta_{1} }}} & {\frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{1} \partial \beta_{2} }}} \\ {\frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{2} \partial \beta_{1} }}} & {\frac{{\partial^{2} LL(\beta_{1} ,\beta_{2} )}}{{\partial \beta_{2} \partial \beta_{2} }}} \\ \end{array} } \right]. $$

(32)

Finally, all prerequisites that researchers need before the custom-built discrete choice modeling process are prepared.

3.3 Visual basic application (VBA)

Computer codes are written for estimation algorithms and the maximum likelihood method that are explained in the previous section. The VBA package in Microsoft Excel was utilized for this purpose. It is notable to know that there are two critical parts for researchers to be concerned about in developing computer codes. The first is to know how the log-likelihood function computer codes could be realized and the second is to know how the estimation algorithms could be included in the language of computer codes. All estimation runs are conducted using a personal computer (Intel Pentium 4, CPU 2.80 GHz, 512 MB Ram).

4 Iteration algorithms and estimation factors

In off-the-shelf software application for modeling, the software primarily enables a researcher to find the optimal set of parameter estimates under a given estimation algorithm, where, estimation factors are prefixed by the vendor. However in custom-built modeling environments, finding parameter estimates typically depends on iteration method predefined earlier in chosen estimation algorithm. Furthermore, to make the estimation operational, a researcher will be directed to input initial estimation factors which would affect the overall estimation performance at the initial stage of estimation. These factors include step size and convergence criterion. Six algorithms are applied to experimentally estimate parameters with a given model specification in the binary logit choice model framework. It is noticed during the course of this research, that the iteration method involved in each six algorithms mutates each other than updating process of approximate Hessian. Therefore, we discussed in detail the role of estimation factors including the estimation method based on the Newton–Raphson (NR) algorithm in the following paragraphs.

Newton–Raphson algorithm uses the real Hessian matrix in its estimation routine by employing true Hessian values generated from the second-derivative of log-likelihood function. The computations for obtaining real Hessian matrix are expensive and costly. Moreover, majority of work associated with the analytical derivation of log-likelihood function are extensive, nevertheless it is very useful [4]. In this research, the mathematical relationships for the real Hessian matrix are formulated by means of direct analytical derivatives from the given log-likelihood function (Eq. 25) and the final form of matrix is provided through the Eqs. 29‐31. Translating estimation procedures into computer codes involves: 1) understanding the fundamental iteration rule for Newton–Raphson algorithm, and 2) separating the functionality of estimation procedures into independent, interchangeable modules. The estimation method of Newton–Raphson employed in this research can be derived by taking the second order Taylor’s approximation of LL(β _t + 1) around LL(β _t) and the details are as follows. The Taylor’s series expansion of f(x) around f(x ₀) can be written as below [4].

$$ f\left( x \right) =\, \frac{{f(x_{0} )}}{(0)!} + \frac{{f^{\left( 1 \right)} \left( {x_{0} } \right)}}{\left( 1 \right)!}\left( {x - x_{0} } \right) +\, \frac{{f^{\left( 2 \right)} \left( {x_{0} } \right)}}{\left( 2 \right)!}\left( {x - x_{0} } \right)^{2} + \cdots + \frac{{f^{\left( n \right)} \left( {x_{0} } \right)}}{\left( n \right)!}(x - x_{0} )^{n} . $$

(33)

The above function can be shortened by taking the second-order approximation of Taylor’s series as shown below:

$$ LL\left( {\beta_{t + 1} } \right) \cong \frac{{LL(\beta_{t} )}}{(0)!} + \frac{{LL^{(1)} \left( {\beta_{t} } \right)}}{\left( 1 \right)!}\left( {\beta_{t + 1} - \beta_{t} } \right) + \frac{{LL^{(2)} \left( {\beta_{t} } \right)}}{\left( 2 \right)!}\left( {\beta_{t + 1} - \beta_{t} } \right)^{2} , $$

(34)

$$ \,\,\,\,= LL(\beta_{t} ) + G_{t} \left( {\beta_{t + 1} - \beta_{t} } \right) + \frac{1}{2}H_{t} \left( {\beta_{t + 1} - \beta_{t} } \right)^{2} . $$

(35)

It should be noted that f(x) is LL(β _t + 1) and f(x ₀) is LL(β _t) in Eq. (34). Based on the Eq. (35), the maximization value of β _t + 1 can be found by setting $ \frac{{\partial LL(\beta_{t + 1} )}}{{\partial \beta_{t + 1} }} = 0 $. The first derivatives with respect to β _t + 1 results in

$$ \frac{{\partial LL\left( {\beta_{t + 1} } \right)}}{{\partial \beta_{t + 1} }} = G_{t} + H_{t} (\beta_{t + 1} - \beta_{t} ). $$

(36)

Consequently, by solving the Eq. 36 for the value of β _t + 1, the iteration rule of Newton–Raphson is represented by the following formula:

$$ \beta_{t + 1} = \beta_{t} + \lambda ( - H_{tReal}^{NR} )^{ - 1} G_{t} , $$

(37)

where β _t is the parameter estimates after t iterations, β _t + 1 is the parameter estimates after t + 1 iterations, λ is the step size, ( −H _tReal ^NR )⁻¹ is the inverse of the negative real Hessian matrix in iteration t, G _t is a vector of gradient in iteration t, ( −H _tReal ^NR )⁻¹ G _t is the search direction of the function in iteration process. Stated more simply, β _t + 1 which is (t + 1) th parameter estimates is equal to β _t which is tth parameter estimates plus negative inverse Hessian multiplied by gradient. A way of the estimation method of other five algorithms considered in this research can be found in detail in the previous study conducted by Roh and Khan [4], Train [3], and Green [6].

In this study, both H _tReal ^NR and G _t are the average value of the Hessian and gradient. The Hessian and gradient are calculated at each step and continuously updated till the convergence. Based on the fact that the log-likelihood function of logit probability is always globally concave [7], the following paragraph explains how parameters are searched during iteration process from the initial guessing points to final convergence of parameter estimates.

The “the direction to follow” concept in the context of iteration method is shown in Fig. 1. G _t is the slope of the log-likelihood function and H _tReal ^NR shows the change in slope of the log-likelihood function. In cases of global concave function, H _tReal ^NR is always negative and thus (H _tReal ^NR )⁻¹ is also negative. The negative of this negative Hessian, −(H _tReal ^NR )⁻¹ G _t is positive. Therefore, the direction of −(H _tReal ^NR )⁻¹ G _t depends entirely on the sign of G _t. If the G _t is positive, β _t moves to the right direction (Fig. 1a), otherwise, β _t moves to the left direction (Fig. 1b) according to the iteration rule seen in Eq. 37. Its step size is given by a researcher at the initial stage of iteration and its magnitude is given by the term, λ( −H _tReal ^NR )⁻¹ G _t, included in the iteration rule. In either case, β _t is moved to the best value that maximizes the log-likelihood function.

The step size implied by λ( −H _tReal ^NR )⁻¹ G _t is decreased, if the curvature represented by −(H _tReal ^NR )⁻¹ has a large value in its magnitude (Fig. 2a); otherwise, the step size is increased (Fig. 2b).

For understanding “step size” in the context of iteration method, it is useful to know the maximization of LL(β):(LL(β) = A + Bβ + Cβ ²), and it’s convergence to maximum. In the case of quadratic function, the optimum value can be obtained with one iteration [3]. However, the problem is that most LL(β) are not quadratic in nature and they require more than one iteration to attain the optimum. This problem can be solved by choosing an appropriate initial step size, and hence minimizing the iteration number and convergence time. The step size determination is explained in the subsequent paragraph.

4.1 Case–I: LL (β) is not a quadratic function

Figure 3 shows two different functions represented by solid and dashed lines. The solid line represents the actual LL(β) and the quadratic form is represented by the dashed line. After one iteration (λ = 1), if the given function is quadratic, then β _t + 1 reaches to the maximum point. But, the actual LL(β) is not quadratic, thus β _t + 1 goes beyond the maximum point and as a result of which, LL(β _t + 1) < LL(β _t). To ensure that LL(β _t + 1) > LL(β _t), and to guarantee an increasing LL(β), one can decrease λ to 1/2. After this operation, it is logical to compare the values between LL(β _t + 1) and LL(β _t). If LL(β _t + 1) < LL(β _t), then λ may be further decreased to 1/4 The λ value will be decreased to smaller values such as 1/8 and 1/16 repeatedly until LL(β _t + 1) > LL(β _t), is satisfied at certain λ.

4.2 Case–II: LL(β) is a quadratic function

The second case is completely opposite to the first case. As shown in Fig. 4, the solid line is for actual LL(β) and dashed line is for quadratic. At first iteration with λ = 1, β _t + 1, reaches to the maximum for the quadratic case. But, the actual LL(β) is not quadratic and the maximum point is far away and λ should be adjusted by taking a large step in order to satisfy LL(β _t + 1) > LL(β _t). In Fig. 4, it can be seen that λ = 2 is the best step size to be chosen for estimation process.

Subsequently, VBA tailored computing codes developed earlier are utilized for the six different estimation algorithms to compare their convergence behaviour in terms of performance measures. The following section presents and discusses the results from the experiments.

5 Experimental results

The computer codes developed for this research worked successfully within the simulated discrete choice model estimation environments. The values of parameter estimates of the model are compared with true value [5] and it is found that the values of parameter estimates are closer in magnitude across six algorithms. This means that the computer codes developed for this empirical estimation worked on right estimation track. The parameter estimates for six algorithms are presented in Table 1 with different step sizes, which λ shows the best performance in a given convergence criterion: CR = 10⁻⁴. It can also be noted that all estimation procedures involved in discrete choice modelling are completely translated into tailored computer estimation codes. The purpose of this research is to get in-depth understanding of calculation mechanism underlying in discrete choice modeling, not to develop a model. Therefore, statistical explanations of parameters are not within the scope of this research.

Table 1

Parameters estimated from six algorithms (CR = 10⁻⁴)

Variables	NR (λ = 1)	BHHH (λ = 1/2)	BHHH-2 (λ = 1/2)	SA (λ = 16)	DFP (λ = 16)	BFGS (λ = 8)
$ \beta_{1} $	−0.237575	−0.237462	−0.237428	−0.237588	−0.237575	−0.237576
$ \beta_{2} $	−3.186590	−3.186410	−3.186355	−3.186671	−3.186590	−3.186590

NR Newton–Raphson, BHHH Berndt, Hall, Hall and Hausman, SA Steepest Ascent, DFP Davidon, Fletcher, and Powell, BFGS Broydon, Fletcher, Goldfarb, and Shanno

The relative performances of algorithms are compared by using the three performance measures: number of iteration, iteration time, and convergence time. In the first case, we chose different convergence criteria and the second case is involved with different initial guesses of the parameter values. After applying various step sizes along with the iteration rule, the experimental runs that converged successfully are selected and summarized.

Before presenting experiment results, overall convergence characteristics of six algorithms is presented using 3D convergence surface as shown in Fig. 5. Only a run showing the best performance for each algorithm is depicted in 3D convergence surface. Variation of log-likelihood values for all observations in each iteration are continuously saved in server until the estimation process is terminated. Unique convergence behaviour of estimation algorithms determined by Hessian updating mechanism prefixed in each algorithm can be clearly demonstrated. Log-likelihood values are presented on Z-axis; the number of iterations is on X-axis, and observation numbers for estimation sample are on Y-axis (Fig. 5b). In Fig. 5a, log-likelihood value is calculated for each observation from the first iteration and it showed the same value (−0.693147181). This value resulted from the assumption that all parameter values are zero in the initial stage of estimation in the model specification. NR, as shown in Fig. 5a, shows the most simple convergence behaviour in terms of small number of iteration (X-axis), meaning that short convergence time. The log-likelihood value for NR was stared with −14.55609079 and converged with the value of −6.166042212 at 6th run. Algorithms performance for DFP and BFGS are shown in Fig. 5e, f respectively. It can be easily seen, that their convergence pattern are similar in terms of both the number of iterations and convergence time. BHHH and BHHH-2 are presented in Fig. 5b, c respectively and are also very similar in their convergence behaviour. These similarities are the result of using almost same iterative process in terms of updating approximate Hessian matrix. SN shown in Fig. 5d, showed similar performance with BHHH and BHHH-2.

5.1 Numerical experiment 1

The two different levels of convergence criteria used in this experiment are used from two earlier studies [5, 8, 9]. The iteration is completed with β _t + 1,k as the solution, when the prefixed stopping precision is satisfied. The mathematical expressions used to represent convergence criteria are shown below. Numerical experiment1 generally uses β ₀ = [β ₀₁ = 0, β ₀₂ = 0]^′, as initial guessing of starting points.

$$\left[\frac{1}{k} \sum\nolimits_{k = 1}^{k} ({\beta_{t + 1,k}} - {\beta_{t,k}})^{2}\right]^{1/2} < CR = {10^{-4}}, $$

(38)

$$ \left[\frac{1}{k}\sum\nolimits_{k = 1}^{k} ({\beta_{t+1,k}} - {\beta_{t,k}})^{2} \right]^{1/2} < CR = {10^{-6}}. $$

(39)

5.1.1 Experimental run with CR = 10⁻⁴

The result from the experimental runs (trials with convergence) based on convergence criteria (CR = 10⁻⁴) for six different algorithms are summarized graphically in Figs. 6 and 7. From these figures, it is conclusive that the NR algorithm is the most appropriate in terms of all three performance measures despite its mathematical difficulty to compute the real Hessian of the log-likelihood function. The performance results based on for one iteration time are found less than 1 s because of a simple model specification. We tried with varied step sizes for all six algorithms to obtain the maximum value for the log-likelihood function. The step size can be considered at random during the initial stage of estimation. NR with step size λ = 1 showed better performance (converged after 6 iterations) than the other algorithms as far as the number of iteration criteria is concerned.

The convergence time can be used as a sign of the operational efficiency of an algorithm because it shows critically the relative superiority in convergence speed of an algorithm. In the case of number of iterations, there are step size differences in the same algorithm; also differences between the six algorithms (Fig. 6). The SA algorithm showed the worst performance. The application of the convergence time showed notable differences between some algorithms. The NR algorithm showed the best performance with respect to the convergence time for all step sizes, followed by the BFGS, DFP and BHHH, BHHH-2, and SA algorithm in order. Particularly, with step size λ equal to 1, the NR algorithm needs just 1 s to finish calculation. Conversely, SA algorithm takes 2 min 39 s with the step size λ = 1/32.

5.1.2 Experimental run with CR = 10⁻⁶

In experiment for the evaluation of performance with different convergence criteria (CR = 10⁻⁶), a different precision level of stopping convergence criteria having the same mathematical expression as shown in Eq. 39 is used. The comparison results are presented graphically in Fig. 8. For all algorithms, the estimation performance has deteriorated as compared to experiment conducted with CR = 10⁻⁴ in terms of all two measures. These deteriorations took place for all algorithms, step sizes, and performance measures. For example, in the case of SA algorithm (λ = 1/32), the experiment convergence time CR = 10⁻⁶ increased significantly (from 0:02:39 to 0:09:43), also the difference of the number of iterations is significant (from 2,320 to 7,033). The pattern of deteriorations is generally the same for all six algorithms. However, NR algorithm required only seven iterations for step size 1. In the run using CR = 10⁻⁶ with a step size one, one iteration number is increased as compared to the run using CR = 10⁻⁴ with a step size one. The convergence behaviour is identical to experiment using CR = 10⁻⁴ except the fact that slight differences occur for all two performance measures in terms of numerical values.

5.2 Numerical experiment 2

The second numerical experiment is intended for the examination of variations in algorithm performance, which may arise by considering various starting points for the parameters. This experiment is initiated to understand how to guess the starting points, and analyze the prospective of the different estimation performances. Nonetheless, searching a new suitable method for a good initial guess of the starting points is not within the scope of this experiment. In this study, the focus is to know that, whether it can make a difference in estimating performance by changing initial starting points, particularly in custom built environment. The outcome of this experiment will help in suggesting a new methodical approach of guessing starting points for a better operational performance. This may be considered as future research of this study. This experiment is designed to in line of this new interesting topic. In this investigation, another vector of parameter is introduced: β ₀ = [β ₀₁ = − 0.1, β ₀₂ = − 0.1]^′, in which two parameters are changed from 0 to −0.1 so as to reflect the case, where an analyst has prior knowledge of the parameter estimates and a negative (−) sign can be assigned to the two parameters. The convergence criterion remains same as it is given in subsect. 5.1.1. Detailed results for all algorithms are presented in Fig. 9. Overall, a little improvements were observed in terms of all performance but, in a few cases, deterioration has also happened. Taking the example of DFP algorithm, a run with step sizes 1/32 and 1/16 did not converge with new starting points, however the same run converged in experiment with ($ CR = 10^{ - 4} \;\text{and}\;10^{ - 6} $). The rationale for this dissimilar convergence behaviour is not definite in this study.

6 Conclusion and discussion

Discrete choice analysis has been of interest to researchers in many diverse disciplines for consumer’s choice behaviour prediction. Particularly, in transportation domain, there have been many applications of discrete choice modeling techniques to predict the behaviour of users consuming transportation systems or services. However, almost all research associated with discrete choice modeling focuses only on the interpretation of the estimated results or on the efforts to find policy implications in model parameters to support decision making process. Furthermore, almost all types of discrete choice models are carried out using commercial statistical packages. In the earlier generation of discrete choice modeling, it was common practice to follow instructions of a selected specific computer package. However, as the choice model specification becomes more complicated and custom-built modeling is increasingly in demand, researchers working in the modeling domain are challenged to go beyond the first generation modeling practices. Nonetheless, there is a lack of literature available to research on estimation algorithms as well as important components in it. Although, some literatures on this may be available in econometric field [8‐10], they are not directly related to the concerns of transportation modellers. Also, these studies have a different theoretical base, which puts some constraints to write the computer codes for transportation modeling application.

Several factors that dominate estimation performance are detected through this experimental model estimation. These are: (a) convergence criterion; (b) initial guessing of starting points. In this research a convergence criterion of $ \left[ {\frac{1}{k}\sum\nolimits_{k = 1}^{k} {(\beta_{t + 1,k} - \beta_{t,k} )^{2} } } \right]^{1/2}, $ is used, which is generally accepted convergence criterion in many literatures [8, 9]. Based on the results, some important aspects might be considered in more elaborated research associated with custom-built modeling. Possibly, potential research extensions are noted below.

(a) Different mathematical formulations for stopping criterion may be tried and different level of stopping accuracy such as $ \left[ {\frac{1}{k}\sum\nolimits_{k = 1}^{k} {(\beta_{t + 1,k} - \beta_{t,k} )^{2} } } \right]^{1/2} < 10^{ - 6} $, 10⁻⁷ can be adopted to make a significant difference in model estimation performance. This issue should be revisited in the future work.

(b) When researchers have to make a guess on initial parameters starting points in custom-built modeling, it would be time saving to make an initial guess of model’s parameters fairly close to true values on the first trial. However almost all choices of initial guess of parameters set up by researchers are not the actual parameter estimates of the given model. In this context, it is a challenge for researchers to be more organized in choosing a vector of parameters, which may be a critical subject in custom-built modeling. This suggests that initial guessing of parameters can affect a model estimation performance to a high extent in terms of iteration numbers, convergence time and so on. A similar research study was reported by Liu and Mahmassani [11] that uses an empirical model estimation process for obtaining parameters of a probit model. They use the genetic algorithm for making a guess of initial parameter estimates. The authors of this paper have demonstrated that initial guessing of parameter can have significant effect on overall model estimation performance. All these issues addressed here are note worthy for future extension of research reported in this paper.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

previous article Obtaining vehicle parameters from bridge dynamic response: a combined semi-analytical and particle filtering approach

Springer Professional

Relative efficiency appraisal of discrete choice modeling algorithms using small-scale maximum likelihood estimator through empirically tailored computing environment

Abstract

1 Introduction

2 Derivation of multinomial logit

3 Methodology

3.1 Data, utility, likelihood, and log-likelihood function

3.2 Derivation of gradient and Hessian formulas

3.3 Visual basic application (VBA)

4 Iteration algorithms and estimation factors

4.1 Case–I: LL (β) is not a quadratic function

4.2 Case–II: LL(β) is a quadratic function

5 Experimental results

5.1 Numerical experiment 1

5.1.1 Experimental run with CR = 10⁻⁴

5.1.2 Experimental run with CR = 10⁻⁶

5.2 Numerical experiment 2

6 Conclusion and discussion

Premium Partner

Variables	NR (λ = 1)	BHHH (λ = 1/2)	BHHH-2 (λ = 1/2)	SA (λ = 16)	DFP (λ = 16)	BFGS (λ = 8)
\( \beta_{1} \)	−0.237575	−0.237462	−0.237428	−0.237588	−0.237575	−0.237576
\( \beta_{2} \)	−3.186590	−3.186410	−3.186355	−3.186671	−3.186590	−3.186590

Springer Professional

Abstract

1 Introduction

2 Derivation of multinomial logit

3 Methodology

3.1 Data, utility, likelihood, and log-likelihood function

3.2 Derivation of gradient and Hessian formulas

3.3 Visual basic application (VBA)

4 Iteration algorithms and estimation factors

4.1 Case–I: LL (β) is not a quadratic function

4.2 Case–II: LL(β) is a quadratic function

5 Experimental results

5.1 Numerical experiment 1

5.1.1 Experimental run with CR = 10−4

5.1.2 Experimental run with CR = 10−6

5.2 Numerical experiment 2

6 Conclusion and discussion

Premium Partner

5.1.1 Experimental run with CR = 10⁻⁴

5.1.2 Experimental run with CR = 10⁻⁶