A copula-based approach to accommodate residential self-selection effects in travel behavior modeling

https://doi.org/10.1016/j.trb.2009.02.001Get rights and content

Abstract

The dominant approach in the literature to dealing with sample selection is to assume a bivariate normality assumption directly on the error terms, or on transformed error terms, in the discrete and continuous equations. Such an assumption can be restrictive and inappropriate, since the implication is a linear and symmetrical dependency structure between the error terms. In this paper, we introduce and apply a flexible approach to sample selection in the context of built environment effects on travel behavior. The approach is based on the concept of a “copula”, which is a multivariate functional form for the joint distribution of random variables derived purely from pre-specified parametric marginal distributions of each random variable. The copula concept has been recognized in the statistics field for several decades now, but it is only recently that it has been explicitly recognized and employed in the econometrics field. The copula-based approach retains a parametric specification for the bivariate dependency, but allows testing of several parametric structures to characterize the dependency. The empirical context in the current paper is a model of residential neighborhood choice and daily household vehicle miles of travel (VMT), using the 2000 San Francisco Bay Area Household Travel Survey (BATS). The sample selection hypothesis is that households select their residence locations based on their travel needs, which implies that observed VMT differences between households residing in neo-urbanist and conventional neighborhoods cannot be attributed entirely to the built environment variations between the two neighborhoods types. The results indicate that, in the empirical context of the current study, the VMT differences between households in different neighborhood types may be attributed to both built environment effects and residential self-selection effects. As importantly, the study indicates that use of a traditional Gaussian bivariate distribution to characterize the relationship in errors between residential choice and VMT can lead to misleading implications about built environment effects.

Introduction

There has been considerable interest in the land use-transportation connection in the past decade, motivated by the possibility that land-use and urban form design policies can be used to control, manage, and shape individual traveler behavior and aggregate travel demand. A central issue in this regard is the debate whether any effect of the built environment on travel demand is causal or merely associative (or some combination of the two; see Bhat and Guo, 2007). To explicate this, consider a cross-sectional sample of households, some of whom live in a neo-urbanist neighborhood and others of whom live in a conventional neighborhood. A neo-urbanist neighborhood is one with high population density, high bicycle lane and roadway street density, good land-use mix, and good transit and non-motorized mode accessibility/facilities. A conventional neighborhood is one with relatively low population density, low bicycle lane and roadway street density, primarily single use residential land use, and auto-dependent urban design. Assume that the vehicle miles of travel (VMT) of households living in conventional neighborhoods is higher than the VMT of households residing in neo-urbanist neighborhoods. The question is whether this difference in VMT between households in conventional and neo-urbanist households is due to “true” effects of the built environment, or due to households self-selecting themselves into neighborhoods based on their VMT desires. For instance, it is at least possible (if not likely) that unobserved factors that increase the propensity or desire of a household to reside in a conventional neighborhood (such as overall auto inclination, a predisposition to enjoying travel, and safety and security concerns regarding non-auto travel) also lead to the household putting more vehicle miles of travel on personal vehicles. If this self-selection is not accounted for, the difference in VMT attributed directly to the variation in the built environment between conventional and neo-urbanist neighborhoods can be mis-estimated. On the other hand, accommodating for such self-selection effects can aid in identifying the “true” causal effect of the built environment on VMT.

The situation just discussed can be cast in the form of Roy’s (1951) endogenous switching model system (see Maddala, 1983, Chapter 9), which takes the following form:rq=βxq+εq,rq=1ifrq>0,rq=0ifrq0,mq0=αzq+ηq,mq0=1[rq=0]mq0,mq1=γwq+ξq,mq1=1[rq=1]mq1.The notation 1[rq = 0] represents an indicator function taking the value 1 if rq = 0 and 0 otherwise, while the notation 1[rq = 1] represents an indicator function taking the value 1 if rq = 1 and 0 otherwise. The first selection equation represents a binary discrete decision of households to reside in a neo-urbanist built environment neighborhood or a conventional built environment neighborhood. rq in Eq. (1) is the unobserved propensity to reside in a conventional neighborhood relative to a neo-urbanist neighborhood, which is a function of an (M × 1)-column vector xq of household attributes (including a constant). β represents a corresponding (M × 1)-column vector of household attribute effects on the unobserved propensity to reside in a conventional neighborhood relative to a neo-urbanist neighborhood. In the usual structure of a binary choice model, the unobserved propensity rq gets reflected in the actual observed choice rq (rq = 1 if the qth household chooses to reside in a conventional neighborhood, and rq = 0 if the qth household decides to reside in a neo-urbanist neighborhood). εq is usually a standard normal or logistic error tem capturing the effects of unobserved factors on the residential choice decision.

The second and third equations of the system in Eq. (1) represent the continuous outcome variables of log(vehicle miles of travel) in our empirical context. mq0 is a latent variable representing the logarithm of miles of travel if a random household q were to reside in a neo-urbanist neighborhood, and mq1 is the corresponding variable if the household q were to reside in a conventional neighborhood. These are related to vectors of household attributes zq and wq, respectively, in the usual linear regression fashion, with ηq and ξq being random error terms. Of course, we observe mq0 in the form of mq0 only if household q in the sample is observed to live in a neo-urbanist neighborhood. Similarly, we observe mq1 in the form of mq1 only if household q in the sample is observed to live in a conventional neighborhood.

The potential dependence between the error pairs (εq, ηq) and (εq, ξq) has to be expressly recognized in the above system, as discussed earlier from an intuitive standpoint.2 The classic econometric estimation approach proceeds by using Heckman’s or Lee’s approaches or their variants (Heckman, 1974, Heckman, 1976, Heckman, 1979, Heckman, 2001, Greene, 1981, Lee, 1982, Lee, 1983, Dubin and McFadden, 1984). Heckman’s (1974) original approach used a full information maximum likelihood method with bivariate normal distribution assumptions for (εq, ηq) and (εq, ξq). Lee (1983) generalized Heckman’s approach by allowing the univariate error terms εq, ηq, and ξq to be non-normal, using a technique to transform non-normal variables into normal variates, and then adopting a bivariate normal distribution to couple the transformed normal variables. Thus, while maintaining an efficient full-information likelihood approach, Lee’s method relaxes the normality assumption on the marginals but still imposes a bivariate normal coupling. In addition to these full-information likelihood methods, there are also two-step and more robust parametric approaches that impose a specific form of linearity between the error term in the discrete choice and the continuous outcome (rather than a pre-specified bivariate joint distribution). These approaches are based on the Heckman method for the binary choice case, which was generalized by Hay, 1980, Dubin and McFadden, 1984 for the multinomial case. The approach involves the first step estimation of the discrete choice equation given distributional assumptions on the choice model error terms, followed by the second step estimation of the continuous equation after the introduction of a correction term that is an estimate of the expected value of the continuous equation error term given the discrete choice. However, these two-step methods do not perform well when there is a high degree of collinearity between the explanatory variables in the choice equation and the continuous outcome equation, as is usually the case in empirical applications. This is because the correction term in the second step involves a non-linear function of the discrete choice explanatory variables. But this non-linear function is effectively a linear function for a substantial range, causing identification problems when the set of discrete choice explanatory variables and continuous outcome explanatory variables are about the same. The net result is that the two-step approach can lead to unreliable estimates for the outcome equation (see Leung and Yu, 2000, Puhani, 2000).

Overall, Lee’s full information maximum likelihood approach has seen more application in the literature relative to the other approaches just described because of its simple structure, ease of estimation using a maximum likelihood approach, and its lower vulnerability to the collinearity problem of two-step methods. But Lee’s approach is also critically predicated on the bivariate normality assumption on the transformed normal variates in the discrete and continuous equation, which imposes the restriction that the dependence between the transformed discrete and continuous choice error terms is linear and symmetric. There are two ways that one can relax this joint bivariate normal coupling used in Lee’s approach. One is to use semi-parametric or non-parametric approaches to characterize the relationship between the discrete and continuous error terms, and the second is to test alternative copula-based bivariate distributional assumptions to couple error terms. Each of these approaches is discussed in turn next.

The potential econometric estimation problems associated with Lee’s parametric distribution approach has spawned a whole set of semi-parametric and non-parametric two-step estimation methods to handle sample selection, apparently having beginnings in the semi-parametric work of Heckman and Robb (1985). The general approach in these methods is to first estimate the discrete choice model in a semi-parametric or non-parametric fashion using methods developed by, among others, Cosslett, 1983, Ichimura, 1993, Matzkin, 1992, Matzkin, 1993, Briesch et al., 2002. These estimates then form the basis to develop an index function to generate a correction term in the continuous equation that is an estimate of the expected value of the continuous equation error term given the discrete choice. While in the two-step parametric methods, the index function is defined based on the assumed marginal and joint distributional assumptions, or on an assumed marginal distribution for the discrete choice along with a specific linear form of relationship between the discrete and continuous equation error terms, in the semi- and non-parametric approaches, the index function is approximated by a flexible function of parameters such as the polynomial, Hermitian, or Fourier series expansion methods (see Vella, 1998, Bourguignon et al., 2007 for good reviews). But, of course, there are “no free lunches”. The semi-parametric and non-parametric approaches involve a large number of parameters to estimate, are relatively very inefficient from an econometric estimation standpoint, typically do not allow the testing and inclusion of a rich set of explanatory variables with the usual range of sample sizes available in empirical contexts, and are difficult to implement. Further, the computation of the covariance matrix of parameters for inference is anything but simple in the semi- and non-parametric approaches. The net result is that the semi- and non-parametric approaches have been pretty much confined to the academic realm and have seen little use in actual empirical application.

The turn toward semi-parametric and non-parametric approaches to dealing with sample selection was ostensibly because of a sense that replacing Lee’s parametric bivariate normal coupling with alternative bivariate couplings would lead to substantial computational burden. However, an approach referred to as the “Copula” approach has recently revived interest in maintaining a Lee-like sample selection framework, while generalizing Lee’s framework to adopt and test a whole set of alternative bivariate couplings that can allow non-linear and asymmetric dependencies. A copula is essentially a multivariate functional form for the joint distribution of random variables derived purely from pre-specified parametric marginal distributions of each random variable. The reasons for the interest in the copula approach for sample selection models are several. First, the copula approach does not entail any more computational burden than Lee’s approach. Second, the approach allows the analyst to stay within the familiar maximum likelihood framework for estimation and inference, and does not entail any kind of numerical integration or simulation machinery. Third, the approach allows the marginal distributions in the discrete and continuous equations to take on any parametric distribution, just as in Lee’s method. Finally, under the copula approach, Lee’s coupling method is but one of a suite of different types of couplings that can be tested.

In this paper, we apply the copula approach to examine built environment effects on vehicle miles of travel (VMT). The rest of this paper is structured as follows. The next section provides a theoretical overview of the copula approach, and presents several important copula structures. Section 3 discusses the use of copulas in sample selection models. Section 4 provides an overview of the data sources and sample used for the empirical application. Section 5 presents and discusses the modeling results. The final section concludes the paper by highlighting paper findings and summarizing implications.

Section snippets

Background

The incorporation of dependency effects in econometric models can be greatly facilitated by using a copula approach for modeling joint distributions, so that the resulting model can be in closed-form and can be estimated using direct maximum likelihood techniques (the reader is referred to Trivedi and Zimmer (2007) or Nelsen (2006) for extensive reviews of copula theory, approaches, and benefits). The word copula itself was coined by Sklar (1959) and is derived from the Latin word “copulare”,

Model estimation and measurement of treatment effects

In the current paper, we introduce copula methods to accommodate residential self-selection in the context of assessing built environments effects on travel choices. To our knowledge, this is the first consideration and application of the copula approach in the urban planning and transportation literature (see Prieger, 2002, Schmidt, 2003 for the application of copulas in the Economics literature). In the next section, we discuss the maximum likelihood estimation approach for estimating the

Data sources

The data used for this analysis is drawn from the 2000 San Francisco Bay Area Household Travel Survey (BATS) designed and administered by MORPACE International Inc. for the Bay Area Metropolitan Transportation Commission (MTC). In addition to the 2000 BATS data, several other secondary data sources were used to derive spatial variables characterizing the activity-travel and built environment in the region. These included: (1) zonal-level land-use/demographic coverage data, obtained from the

Variables considered

Several categories of variables were considered in the analysis, including household demographics, employment characteristics, and neighborhood characteristics. The neighborhood characteristics considered include population density, employment density, Hansen-type accessibility measures (such as accessibility to employment and accessibility to shopping; see Bhat and Guo, 2007 for the precise functional form), population by ethnicity in the neighborhood, presence/number of schools and physically

Conclusions and implications

In the current study, we apply a copula based approach to model residential neighborhood choice and daily household vehicle miles of travel (VMT) using the 2000 San Francisco Bay Area Household Travel Survey (BATS). The self-selection hypothesis in the current empirical context is that households select their residence locations based on their travel needs, which implies that observed VMT differences between households residing in neo-urbanist and conventional neighborhoods cannot be attributed

Acknowledgments

This research was funded in part by Environmental Protection Agency Grant R831837. The authors are grateful to Lisa Macias for her help in formatting this document. Two anonymous referees provided valuable comments on an earlier version of this paper.

References (66)

  • S. Bourguignon et al.

    A sparsity-based method for the estimation of spectral lines from irregularly sampled data

    IEEE Journal of Selected Topics in Signal Processing

    (2007)
  • Boyer, B., Gibson, M., Loretan, M., 1999. Pitfalls in tests for changes in correlation. International Finance...
  • R.A. Briesch et al.

    Semiparametric estimation of brand choice behavior

    Journal of the American Statistical Association

    (2002)
  • A.C. Cameron et al.

    Modelling the differences in counted outcomes using bivariate copula models with application to mismeasured counts

    The Econometrics Journal

    (2004)
  • U. Cherubini et al.

    Copula Methods in Finance

    (2004)
  • D.G. Clayton

    A model for association in bivariate life tables and its application in epidemiological studies of family tendency in chronic disease incidence

    Biometrika

    (1978)
  • Conway, D.A., 1979. Multivariate distributions with specified marginals. Technical Report #145, Department of...
  • S.R. Cosslett

    Distribution-free maximum likelihood estimation of the binary choice model

    Econometrica

    (1983)
  • J.A. Dubin et al.

    An econometric analysis of residential electric appliance holdings and consumption

    Econometrica

    (1984)
  • P. Embrechts et al.

    Correlation and dependence in risk management: properties and pitfalls

  • D.J.G. Farlie

    The performance of some correlation coefficients for a general bivariate distribution

    Biometrika

    (1960)
  • M.J. Frank

    On the simultaneous associativity of F(x, y) and x + y  F(x, y)

    Aequationes Mathematicae

    (1979)
  • E.W. Frees et al.

    Credibility using copulas

    North American Actuarial Journal

    (2005)
  • C. Genest et al.

    Everything you always wanted to know about copula modeling but were afraid to ask

    Journal of Hydrologic Engineering

    (2007)
  • C. Genest et al.

    Copules archimediennes et familles de lois bidimensionnelles dont les marges sont donnees

    The Canadian Journal of Statistics

    (1986)
  • M. Genius et al.

    Applying the copula approach to sample selection modeling

    Applied Economics

    (2008)
  • W. Greene

    Sample selection bias as a specification error: a comment

    Econometrica

    (1981)
  • E.J. Gumbel

    Bivariate exponential distributions

    Journal of the American Statistical Association

    (1960)
  • Hay, J.W., 1980. Occupational choice and occupational earnings: Selectivity bias in a simultaneous logit-OLS model....
  • J. Heckman

    Shadow prices, market wages and labor supply

    Econometrica

    (1974)
  • J. Heckman

    The common structure of statistical models of truncation, sample selection, and limited dependent variables and a simple estimator for such models

    The Annals of Economic and Social Measurement

    (1976)
  • J.J. Heckman

    Sample selection bias as a specification error

    Econometrica

    (1979)
  • J.J. Heckman

    Microdata, heterogeneity and the evaluation of public policy

    Journal of Political Economy

    (2001)
  • Cited by (234)

    • Copula modeling from Abe Sklar to the present day

      2024, Journal of Multivariate Analysis
    View all citing articles on Scopus
    1

    Tel.: +1 512 471 4535; fax: +1 512 475 8744.

    View full text